# Data Preprocessing Tools

## Importing the libraries

Initialize the libraries used in this preprocessing lesson

In [12]:
import numpy as np
import matplotlib.pyplot as plt # library.module
import pandas as pd

## Importing the dataset

Read the ```Data.csv``` file and initialize the independent variables and dependent variable

In [13]:
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values # ind variable: all rows, all columns except last
y = dataset.iloc[:, -1].values # dep variable: all rows, last column

In [14]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [15]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

```Age``` and ```Salary``` both have a missing cell. To fill these in, one method is to fill these in with their averages.

In [16]:
from sklearn.impute import SimpleImputer # class.module 

# replace empty missing values with the mean in the column
imputer = SimpleImputer(missing_values=np.nan, strategy="mean") 

# apply transformation of value replacements in only Age and Salary columns
# in practice, include all columns with numerical data
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

Machine learning could interpret the order of the countries as relevant to the model output
To avoid this, implement one hot encoding
One hot encoding creates binary vectors e.g. ```France``` = ```[1,0,0]```, ```Spain``` = ```[0,0,1]```, ```Germany``` = ```[0,1,0]```
Vectors are not necessarily in the same order as the order of occurrence for the country

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# columns to be transformed are noted in 3rd arg of transformers; passthrough to keep the other columns unchanged
ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

In [18]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

The ```Purchased``` outputs ```Yes``` and ```No``` may not be handled correctly so these are converted into binary values.

In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # no args because one vector
y = le.fit_transform(y)

In [20]:
print(y) # where "No" = 0, "Yes" = 1

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Needs four sets of data for ```X``` = [```Country```, ```Age```,  ```Salary```] and y = ```Purchased```
1. ```X_train``` = independent variables of Training set
2. ```X_test``` = independent variables of Test set 
3. ```y_train``` = dependent variable of Training set
4. ```y_test``` = dependent variable of Test set

Recommended split size: 80% Training, 20% Test

Optional ```random_state``` seed is fixed at a value of ```1``` to produce the same output as in the lesson

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [23]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [24]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [25]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [26]:
print(y_test)

[0 1]


## Feature Scaling

Rescales or removes dominance of one feature over another.\
Develop feature scaling parameters with training set to apply to test set.\
Examples of applications: multiple linear regression, polynomial linear regression\
Always apply after dataset split.

**Standardization**: Performs all the time\
-3 <= range <= 3 \
$$ x_{stand} = \frac{x - mean(x)}{\text{standard deviation (x)}} $$

**Normalization**: Recommended if there is a normal distribution\
0 <= range <= 1 \
$$ x_{norm} = \frac{x - min(x)}{max(x) - min(x)} $$

In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # transform Age and Salary columns only in X_train
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # apply transformation to X_test

In [28]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [29]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
