# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np # scientific computing library - perform fast operations on arrays
import matplotlib.pyplot as plt # graphs library
import pandas as pd # data analysis library

## Importing the dataset

In [3]:
# Retrieve data frame from pandas
dataframe = pd.read_csv('Data.csv')
print(dataframe)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


### About Dependent Variables
The dependent variable is what you want to predict. In the above dataset, the dependent variable is the purchase column -- the other columns are features. The dependent variable is usually the last column in the dataset.

We always separate our dependent variable from the features.



In [58]:
# Separate the dependent variable from the matrix of features.

# The matrix of features.
# ":" means a range. No lower or upper bound means select all columns.
# Retrieve all columns except the last column, then return only values.
x = dataframe.iloc[:, :-1].values

# The dependent variable vector.
y = dataframe.iloc[:, -1].values

## Taking care of missing data


Scikit-learn is a machine learning library with data preprocessing tools.

We can replace empty values with an average value from the column.

In [59]:
# reveal 0 values in dataset
# print((dataframe == 0).sum)

# Replace empty values with an average value
from sklearn.impute import SimpleImputer
# Call the SimpleImputer Class
# Specify the missing values you want to replace (np.nan) "not a number"
# Specify the strategy (mean)
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
# fit() will connect the imputer and the matrix of features.
# transform() will execute the change.
# Select all the numerical columns. 
# Start at col1 - ending at col3 not including col3.
imputer.fit(x[:, 1:3])
 # Transform | Start at col1 - ending at col3 not including col3.
# returns a new matrix with replacement values
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [60]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

For the machine learning algorithm to interpret correlations between the features and the dependent variable, we must turn the strings into numbers. However, we don't want our machine learning algorithm to interpret that order matters. How do we solve this?

### One-hot encoding

One-hot encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. One-hot encoding creates many features. So the country feature becomes 3 columns, one for each country. 

One-hot encoding creates binary vectors for each or the features.

### Encoding the Independent Variable

In [61]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# passthrough keeps columns not affected by OneHotEncoding
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
# fit and transform at the same time
# force output to be a numpy array
x = np.array(ct.fit_transform(x))

In [62]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [63]:
from sklearn.preprocessing import LabelEncoder
# Use the label encoder when you have two labels (no,yes) => (0,1)
# These labels can be converted into a binary outcome.
le = LabelEncoder()

# Doesn't need to be a numpy array like what is expected by the feature
# machine learning models

y = le.fit_transform(y)

In [64]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [65]:
# import train test split library
from sklearn.model_selection import train_test_split
# create 4 variables from return values
# x = matrix of features
# y = dependent variable
# random_state fixes the seed so that consecutive runs split the data the same way (not required, only for education purposes)
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, train_size=0.8,random_state=1)

In [66]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [67]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [68]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [69]:
print(y_test)

[0 1]


## Feature Scaling

Feature scaling normalizes features to prevent one feature from dominating the other. 

### Do's and Don'ts

---
- __Do__ feature scaling after splitting to __prevent information leakage__. You're not suppose to work with the test dataset when training the model. 
---
- __Don't__ apply feature scaling on dummy variables.
  - Doing so would destroy the integrity of the data making it harder to interpret. For example, applying standardization would change 0 and 1s to -3 and 3s. It would then be hard to know which feature is active for a given row.

- __Don't__ include the test dataset when training the model.
  - You are not suppose to use the test dataset when training the model. The test dataset should appear as new data to the model during our tests. 
  
- __Don't__ apply feature scaling before splitting data.
  - Feature scaling gets the mean and standard deviation of the feature. If we apply feature scaling before the split it will get the mean and the standard deviation of all the values including the ones in the test set. This would create information leakage. 
---

In [70]:
from sklearn.preprocessing import StandardScaler
# no parameters required
sc = StandardScaler()

# don't apply standardization to the dummy variables, you will loose the meaning 
# of the data. Keep the interpretability of the model.

# fit get the mean and standard deviation of each of the features.
# transform applies the formula to the features so everything ends up at the same scale. 
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
# Use only sc.transform(...) on the test dataset!
# You need to use same Scaler used on the training dataset, otherwise you'll get a new scaler.
# That won't make sense because X_test will be the input of the predict function
# In order to make predictions that will be congruent with the way the model is trained,
# we need to apply the same scaler used on the training set to the test dataset to get the same transformation
# This is the only way to get relevant predictions.
X_test[:,3:] = sc.transform(X_test[:,3:])


In [71]:
print(X_train)

# Age and Salary are transformed
# Now these values are on the same scale which is perfect to optimize the training of certain machine learning models.

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [72]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
