# Data Preparation
Data preparation is a critical phase in machine learning and it has been said that a good 80% of the effort may be spent from collecting and then preparing data for use. Steps of data cleaning and organization can help to direct the learning towards the intended goal while the lack of them will likely be an unsuccessful model. Data can have discrepancies, errors, outliers and missing attributes of interest and we will see how some of theses issues can be handled in the following steps

## 1 Importing the libraries
As per most work, libraries of functions that will be used in the data preparation process need to be imported into the notebook.

In [1]:
# import numpy, matplotlib.pylot and panda
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

# import arff
import requests, io, zipfile
from scipy.io import arff

# import imputers for handling missing value and encoders
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder


## 2 Importing the dataset

Data can be retrieved in various formats. The examples below read data from ARFF, JSON and CSV.

### Reading from ARFF

In [2]:
# download a copy of an archived data set and extract the zip file to the notebook's folder
f_zip = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00212/vertebral_column_data.zip'
r = requests.get(f_zip, stream=True)
Vertebral_zip = zipfile.ZipFile(io.BytesIO(r.content))
Vertebral_zip.extractall()

In [8]:
# read the ARFF file and store it as a dataframe
data = arff.loadarff('column_2C_weka.arff')
df1 = pd.DataFrame(data[0])   #data[1] is the column names
print(df1)

     pelvic_incidence  pelvic_tilt  lumbar_lordosis_angle  sacral_slope  \
0           63.027817    22.552586              39.609117     40.475232   
1           39.056951    10.060991              25.015378     28.995960   
2           68.832021    22.218482              50.092194     46.613539   
3           69.297008    24.652878              44.311238     44.644130   
4           49.712859     9.652075              28.317406     40.060784   
..                ...          ...                    ...           ...   
305         47.903565    13.616688              36.000000     34.286877   
306         53.936748    20.721496              29.220534     33.215251   
307         61.446597    22.694968              46.170347     38.751628   
308         45.252792     8.693157              41.583126     36.559635   
309         33.841641     5.073991              36.641233     28.767649   

     pelvic_radius  degree_spondylolisthesis        class  
0        98.672917                 -0.2

### Reading from JSON

In [9]:
# Create a JSON file from excel
df2 = pd.read_excel('data2.xlsx',index_col=0) # use column 0 as the row labels
df2.to_json('data2.json')
df2


Unnamed: 0_level_0,Age,Salary,Married
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jenny,54.0,72000.0,Y
Tommy,27.0,48000.0,N
Gilbert,30.0,54000.0,N
Dorothy,38.0,61000.0,Y
David,40.0,,N
Francis,35.0,58000.0,N
Julie,,52000.0,N
Apple,48.0,79000.0,Y
Peter,50.0,83000.0,N
Joe,37.0,67000.0,Y


In [10]:
# Read the newly created JSON as a dataframe
df3 = pd.read_json("data2.json")
df3

Unnamed: 0,Age,Salary,Married
Jenny,54.0,72000.0,Y
Tommy,27.0,48000.0,N
Gilbert,30.0,54000.0,N
Dorothy,38.0,61000.0,Y
David,40.0,,N
Francis,35.0,58000.0,N
Julie,,52000.0,N
Apple,48.0,79000.0,Y
Peter,50.0,83000.0,N
Joe,37.0,67000.0,Y


### Reading from CSV

In [11]:
# Create a CSV file from excel
df4 = pd.read_excel('data2.xlsx',index_col=0)
df4.to_csv('data2.csv')

In [12]:
# Read CSV files and extract into features and target
dataset = pd.read_csv('data2.csv')
dataset

Unnamed: 0,Name,Age,Salary,Married
0,Jenny,54.0,72000.0,Y
1,Tommy,27.0,48000.0,N
2,Gilbert,30.0,54000.0,N
3,Dorothy,38.0,61000.0,Y
4,David,40.0,,N
5,Francis,35.0,58000.0,N
6,Julie,,52000.0,N
7,Apple,48.0,79000.0,Y
8,Peter,50.0,83000.0,N
9,Joe,37.0,67000.0,Y


## 3 Taking care of missing data

There are several ways to handle missing data but only the following will be covered in this exercise
* remove the rows with missing data.
* impute missing values with mean, median or mode

### Dropping rows with missing data
The dropna function's axis argument is default to 0 (along row) where any value within the row being NaN will result in the row being removed. You can set it to one to remove columns with NaN values.

Removing missing values creates a strong model but there may be a loss of a lot of data. This will work poorly if the amount of removal is significant in the dataset.

In [13]:
dataset.dropna()

Unnamed: 0,Name,Age,Salary,Married
0,Jenny,54.0,72000.0,Y
1,Tommy,27.0,48000.0,N
2,Gilbert,30.0,54000.0,N
3,Dorothy,38.0,61000.0,Y
5,Francis,35.0,58000.0,N
7,Apple,48.0,79000.0,Y
8,Peter,50.0,83000.0,N
9,Joe,37.0,67000.0,Y


### Impute missing values with mean, median or mode

With numerical continous values, there is an option to use the mean, median or mode values to fill the missing values. The missing values can also be set to zero or a particular scalar value.

In [16]:
# Replacing with a scalar value
#dataset.fillna(0)
dataset.replace({np.NaN:0})

Unnamed: 0,Name,Age,Salary,Married
0,Jenny,54.0,72000.0,Y
1,Tommy,27.0,48000.0,N
2,Gilbert,30.0,54000.0,N
3,Dorothy,38.0,61000.0,Y
4,David,40.0,0.0,N
5,Francis,35.0,58000.0,N
6,Julie,0.0,52000.0,N
7,Apple,48.0,79000.0,Y
8,Peter,50.0,83000.0,N
9,Joe,37.0,67000.0,Y


In [17]:
# Extract the values into features and target
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [18]:
print(x)

[['Jenny' 54.0 72000.0]
 ['Tommy' 27.0 48000.0]
 ['Gilbert' 30.0 54000.0]
 ['Dorothy' 38.0 61000.0]
 ['David' 40.0 nan]
 ['Francis' 35.0 58000.0]
 ['Julie' nan 52000.0]
 ['Apple' 48.0 79000.0]
 ['Peter' 50.0 83000.0]
 ['Joe' 37.0 67000.0]]


In [19]:
print(y)

['Y' 'N' 'N' 'Y' 'N' 'N' 'N' 'Y' 'N' 'Y']


In [20]:
# Replacing with mean value
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [21]:
print(x)

[['Jenny' 54.0 72000.0]
 ['Tommy' 27.0 48000.0]
 ['Gilbert' 30.0 54000.0]
 ['Dorothy' 38.0 61000.0]
 ['David' 40.0 63777.77777777778]
 ['Francis' 35.0 58000.0]
 ['Julie' 39.888888888888886 52000.0]
 ['Apple' 48.0 79000.0]
 ['Peter' 50.0 83000.0]
 ['Joe' 37.0 67000.0]]


## 4 Encoding categorical data

Categorical data can only take on a limited and usualy fixed number of values. For example, gender as described by Male or Female, and job positions are categorical.

Categorical data can be 
* Nominal
* Ordinal

In general, nominal data are labeled with no specific order while ordinal data have a specific order. Gender is a nominal data while the level of satisfaction (indicated as poor/average/good) is ordinal. 


### Encoding the Independent Variable

Computer are unable to process categorical data. These data have to be processed and one-hot encoding is widely used because simple labeling using numerical number introduces an order that may not be valid.

The basic strategy in One-Hot encoding is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column.

In [23]:
# Read in a new dataset from CSV

df6 = pd.read_csv('categorical.csv')
df6

Unnamed: 0,Country,Transactions,Salary,Expiry
0,Singapore,54,72000,2007-02
1,Malaysia,47,58000,2007-03
2,Thailand,30,54000,2021-04
3,Singapore,20,61000,2019-05
4,Singapore,40,58000,2019-06
5,Thailand,35,58000,2018-04
6,Singapore,45,52000,2007-04
7,Vietnam,10,58000,2016-06
8,Indonesia,50,52000,2006-09


In [24]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
df6 = np.array(ct.fit_transform(df6))

In [25]:
print(df6)

[[0.0 0.0 1.0 0.0 0.0 54 72000 '2007-02']
 [0.0 1.0 0.0 0.0 0.0 47 58000 '2007-03']
 [0.0 0.0 0.0 1.0 0.0 30 54000 '2021-04']
 [0.0 0.0 1.0 0.0 0.0 20 61000 '2019-05']
 [0.0 0.0 1.0 0.0 0.0 40 58000 '2019-06']
 [0.0 0.0 0.0 1.0 0.0 35 58000 '2018-04']
 [0.0 0.0 1.0 0.0 0.0 45 52000 '2007-04']
 [0.0 0.0 0.0 0.0 1.0 10 58000 '2016-06']
 [1.0 0.0 0.0 0.0 0.0 50 52000 '2006-09']]


A sparse matrix is a matrix that is comprised of mostly zero values. Its use can lead to enormous computational savings. The Compressed Sparse Row, also called CSR for short, is often used to represent sparse matrices in machine learning given the efficient access and matrix multiplication that it supports.

In [28]:
#encode the categorical data of name 
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x_final = scipy.sparse.csr_matrix(ct.fit_transform(x)).toarray()
print(x_final)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 5.40000000e+01 7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
  0

### Encoding the Dependent Variable

Label Encoding is used to convert each value in a column to a number.

In [29]:
le = LabelEncoder()
y = le.fit_transform(y)

In [30]:
print(y)

[1 0 0 1 0 0 0 1 0 1]


## 5 Splitting the dataset into the Training set and Test set

The machine learning alogrithm essentially works in two stage of training and testing but you may see the following definition.

Training dataset - The sample of data used to fit the model

Validation dataset - The sample of data used to provide an unbiased evaluation of a model fit on the training while tuning model hyperparameters. The evaluation because more biased as skill on the validation dataset is incorporated into the model configuration.

Test dataset - The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset

The test dataset should be carefully sampled to spaces the various scenarios that a model would encounter in the read world. It would be used once after a model is completely trained while the validation dataset is used as part of the development dataset.

For ease of understanding, we will focus on just the training data and test data. For your self-learning, you can search for Cross Validation. In cross validation, you essentially use your training set to generate multiple splits of the Train and Validation sets.

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_final, y, test_size = 0.2, random_state = 1)

In [32]:
print(X_train)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 0.00000000e+00 3.98888889e+01 5.20000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 5.40000000e+01 7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0

In [33]:
print(X_test)

[[0.0e+00 0.0e+00 0.0e+00 0.0e+00 1.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00
  0.0e+00 3.0e+01 5.4e+04]
 [0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 0.0e+00 1.0e+00 0.0e+00 0.0e+00
  0.0e+00 3.7e+01 6.7e+04]]


In [34]:
print(y_train)

[0 0 1 1 0 1 0 0]


In [35]:
print(y_test)

[0 1]


## 6 Feature Scaling

Feature scaling is a method used to normalize or standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

You will see that feature scaling is carried out after separating the data into training data, and test data. This is to avoid the information from the test data from being used during the scaling of the training data.

When data are being used in machine learning, the values of features can have very different ranges. One feature could be in kg while another could be in grams. The value can also be very different in magnitude. For example

|Transaction | Volume | Average Price |
|---|---|---|
|1|50000| 1.45|
|2|120000| 2.44|
|3|450000| 2.11|
|4|700000| 1.60|
|5|800000| 1.72|

In this scenario, with largely huge volume value, it is possible that a machine learning algorithm, which cannot recognize the context of a number '800000' versus '1.72' may put more emphasis and priority on the volume.

By scaling the values for each column to a similar range, the perfomance of the a machine learning algorithm can be improved. However, it must be noted that not all machine learning benefit from feature scaling. Distance-based algorithm often benefits from feature scaling while tree-based alogrithms will be insensitive to the scaling of features. Some of these algorithms that benefits include
* linear and logistic regression
* nearest neighbors
* neural networks
* support vector machines with radial bias kernel functions
* principal components analysis
* linear discriminant analysis

The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1. If data is not normally distributed, this is not the best scaler to use.

The MinMaxScaler is the probably the most famous scaling algorithm. It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values). This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

There are other scalers such as the RobustScaler, which is similar to Min-Max scaler but as it uses the interquartile range instead of the min-max, it is more robust to outliers. 

The normalizer normalizes rows (samplewise), and not columns (featurewise). 

Most business data aims to study relations across samples and to predict for new samples, which will likely benefit from featurewise normalization. 

In [36]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 10:] = sc.fit_transform(X_train[:, 10:])
X_test[:, 10:] = sc.transform(X_test[:, 10:])

In [37]:
print(X_train)

[[ 0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.         -0.19434578 -1.07812594]
 [ 0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.         -0.18082607 -0.07013168]
 [ 0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          1.52265694  0.63356243]
 [ 0.          0.          1.          0.          0.          0.
   0.          0.          0.          0.         -0.42418079 -0.30786617]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          0.          1.         -1.76263173 -1.42046362]
 [ 1.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.79259279  1.23265336]
 [ 0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          1.03594751  1.57499104]
 [ 0.        

In [38]:
print(X_test)

[[ 0.          0.          0.          0.          1.          0.
   0.          0.          0.          0.         -1.39759966 -0.9069571 ]
 [ 0.          0.          0.          0.          0.          0.
   1.          0.          0.          0.         -0.54585815  0.20564034]]


# Exercise

Import the dataset from 'data_practice.xlsx' and use the steps you have went through in this practical to prepare the data.

## Import the libraries 
(Only need to import libraries/modules once)


## Import the dataset

In [40]:
#todo
# import numpy, matplotlib.pylot and panda
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy

# import arff
import requests, io, zipfile
from scipy.io import arff

# import imputers for handling missing value and encoders
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder



In [41]:
# read in the excel and remove the unnecessary empty columns
x = pd.read_excel('data_practice.xlsx')
x = x.drop(x.columns[5],axis=1)
x = x.drop(x.columns[5],axis=1)
#x = x.dropna(axis='columns')
print(x)

          Name    Age    Salary         Position Joined in (Year) Married
0        Jenny   54.0   72000.0      Snr manager          2007-02       Y
1        Tommy   47.0   48000.0              Mgr          2007-03       N
2      Gilbert   30.0   54000.0          Manager          2021-04       N
3      Dorothy   38.0   61000.0      Project mgr          2019-05       Y
4        David   40.0       NaN         Engineer          2019-06       N
5      Francis   35.0   58000.0          Manager          2018-04       N
6        Julie    NaN   52000.0         Engineer          2007-04       N
7        Apple   48.0  790000.0         Director          2016-06       Y
8        Peter   50.0  830000.0         Director          2006-09       N
9          Joe   37.0   67000.0         Engineer          2007-11       Y
10   Billy See   31.0   60000.0              Mgr          2007-12       Y
11         Tim   34.0   40000.0             Engr          2007-13       Y
12        Zack   49.0   45000.0       

## Take care of missing values

In [42]:
#todo (make string positions values consistent and drop unnecessary columns)
x = x[x['Joined in (Year)'].notna()]
x.iloc[:,3] = x.iloc[:,3].str.upper()
x.iloc[:,3] = x.iloc[:,3].str.replace('ENGINEER', 'ENGR')
x.iloc[:,3] = x.iloc[:,3].str.replace('MANAGER', 'MGR')
#x.iloc[:,3] = x.iloc[:,3].replace('SNR MANAGER', 'SNR MGR')
#x.iloc[:,3] = x.iloc[:,3].replace('PROJECT MANAGER', 'PROJECT MGR')
print(x)

          Name    Age    Salary     Position Joined in (Year) Married
0        Jenny   54.0   72000.0      SNR MGR          2007-02       Y
1        Tommy   47.0   48000.0          MGR          2007-03       N
2      Gilbert   30.0   54000.0          MGR          2021-04       N
3      Dorothy   38.0   61000.0  PROJECT MGR          2019-05       Y
4        David   40.0       NaN         ENGR          2019-06       N
5      Francis   35.0   58000.0          MGR          2018-04       N
6        Julie    NaN   52000.0         ENGR          2007-04       N
7        Apple   48.0  790000.0     DIRECTOR          2016-06       Y
8        Peter   50.0  830000.0     DIRECTOR          2006-09       N
9          Joe   37.0   67000.0         ENGR          2007-11       Y
10   Billy See   31.0   60000.0          MGR          2007-12       Y
11         Tim   34.0   40000.0         ENGR          2007-13       Y
12        Zack   49.0   45000.0         ENGR          2007-06       N
13         Roy   42.

In [43]:
# read as date time and use only the year
x['Joined in (Year)'] = pd.to_datetime(x['Joined in (Year)'], format="%Y-%m", errors ="coerce")
x['Joined in (Year)'] = x['Joined in (Year)'].dt.year

x = x.iloc[:,1:]
print(x)

      Age    Salary     Position  Joined in (Year) Married
0    54.0   72000.0      SNR MGR            2007.0       Y
1    47.0   48000.0          MGR            2007.0       N
2    30.0   54000.0          MGR            2021.0       N
3    38.0   61000.0  PROJECT MGR            2019.0       Y
4    40.0       NaN         ENGR            2019.0       N
5    35.0   58000.0          MGR            2018.0       N
6     NaN   52000.0         ENGR            2007.0       N
7    48.0  790000.0     DIRECTOR            2016.0       Y
8    50.0  830000.0     DIRECTOR            2006.0       N
9    37.0   67000.0         ENGR            2007.0       Y
10   31.0   60000.0          MGR            2007.0       Y
11   34.0   40000.0         ENGR               NaN       Y
12   49.0   45000.0         ENGR            2007.0       N
13   42.0   44000.0         ENGR            2007.0       N
14   52.0   50000.0         ENGR            2007.0       Y
15   55.0  400000.0     DIRECTOR            2007.0      

In [44]:
x = x[x['Joined in (Year)'].notna()]
x

Unnamed: 0,Age,Salary,Position,Joined in (Year),Married
0,54.0,72000.0,SNR MGR,2007.0,Y
1,47.0,48000.0,MGR,2007.0,N
2,30.0,54000.0,MGR,2021.0,N
3,38.0,61000.0,PROJECT MGR,2019.0,Y
4,40.0,,ENGR,2019.0,N
5,35.0,58000.0,MGR,2018.0,N
6,,52000.0,ENGR,2007.0,N
7,48.0,790000.0,DIRECTOR,2016.0,Y
8,50.0,830000.0,DIRECTOR,2006.0,N
9,37.0,67000.0,ENGR,2007.0,Y


## Encode categorical data

In [46]:
# Extract the values into features and target
x_final = x.iloc[:, :-1].values
y = x.iloc[:, -1].values

#todo (for simplicity, there is not need to use the scipy.sparse.csr_matrix)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')
x_final = np.array(ct.fit_transform(x_final))

#todo (encode the target)
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1]


## Split dataset for training and test

In [47]:
#todo
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_final, y, test_size = 0.2, random_state = 1)

## Feature scaling

In [48]:
print(X_train)

[[0.0 0.0 0.0 1.0 0.0 42.0 58000.0 2007.0]
 [0.0 0.0 1.0 0.0 0.0 31.0 60000.0 2007.0]
 [0.0 0.0 1.0 0.0 0.0 42.0 60000.0 2019.0]
 [0.0 1.0 0.0 0.0 0.0 40.0 nan 2019.0]
 [0.0 0.0 1.0 0.0 0.0 30.0 54000.0 2021.0]
 [0.0 0.0 0.0 0.0 1.0 45.0 70000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 nan 52000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 51.0 42000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 52.0 50000.0 2007.0]
 [1.0 0.0 0.0 0.0 0.0 48.0 790000.0 2016.0]
 [0.0 1.0 0.0 0.0 0.0 47.0 42000.0 2012.0]
 [0.0 0.0 1.0 0.0 0.0 47.0 48000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 24.0 40000.0 2021.0]
 [0.0 0.0 0.0 0.0 1.0 54.0 72000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 23.0 35000.0 2021.0]
 [0.0 1.0 0.0 0.0 0.0 35.0 40000.0 2021.0]
 [0.0 1.0 0.0 0.0 0.0 278.0 42000.0 2021.0]
 [0.0 1.0 0.0 0.0 0.0 37.0 67000.0 2007.0]
 [1.0 0.0 0.0 0.0 0.0 50.0 830000.0 2006.0]
 [0.0 1.0 0.0 0.0 0.0 42.0 44000.0 2007.0]
 [0.0 1.0 0.0 0.0 0.0 49.0 45000.0 2007.0]
 [0.0 0.0 1.0 0.0 0.0 35.0 58000.0 2018.0]]


In [49]:
# Replacing with mean value
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Do for both train and test set
imputer.fit(X_train[:, 5:7])
X_train[:, 5:7] = imputer.transform(X_train[:, 5:7])
imputer.fit(X_test[:, 5:7])
X_test[:, 5:7] = imputer.transform(X_test[:, 5:7])

# Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 5:7] = sc.fit_transform(X_train[:, 5:7])
X_test[:, 5:7] = sc.transform(X_test[:, 5:7])

print(X_train)

[[0.0 0.0 0.0 1.0 0.0 -0.2095214512869798 -0.30188862307930087 2007.0]
 [0.0 0.0 1.0 0.0 0.0 -0.42951897513830867 -0.29270736155190724 2007.0]
 [0.0 0.0 1.0 0.0 0.0 -0.2095214512869798 -0.29270736155190724 2019.0]
 [0.0 1.0 0.0 0.0 0.0 -0.24952100107813052 0.0 2019.0]
 [0.0 0.0 1.0 0.0 0.0 -0.449518750033884 -0.32025114613408817 2021.0]
 [0.0 0.0 0.0 0.0 1.0 -0.14952212660025377 -0.24680105391493895 2007.0]
 [0.0 1.0 0.0 0.0 0.0 0.0 -0.32943240766148185 2007.0]
 [0.0 1.0 0.0 0.0 0.0 -0.02952347722680167 -0.3753387152984501 2007.0]
 [0.0 1.0 0.0 0.0 0.0 -0.009523702331226323 -0.3386136691888755 2007.0]
 [1.0 0.0 0.0 0.0 0.0 -0.08952280191352772 3.058453095946777 2016.0]
 [0.0 1.0 0.0 0.0 0.0 -0.10952257680910307 -0.3753387152984501 2012.0]
 [0.0 0.0 1.0 0.0 0.0 -0.10952257680910307 -0.34779493071626916 2007.0]
 [0.0 1.0 0.0 0.0 0.0 -0.5695173994073361 -0.38451997682584377 2021.0]
 [0.0 0.0 0.0 0.0 1.0 0.030475847459924373 -0.2376197923875453 2007.0]
 [0.0 1.0 0.0 0.0 0.0 -0.589517174302