# Data Preprocessing

## Importing the libraries

In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

import sklearn.model_selection import train_test_split

SyntaxError: invalid syntax (<ipython-input-17-af93ee88c2cd>, line 13)

## Importing the dataset

In [18]:
dataset = pd.read_csv('Data.csv')
# iloc[rows,columns] is used to reference local indexes
X = dataset.iloc[:, :-1].values # Matrix of features. Note that range includes lower bound but does not include upper bound of '-1'
y = dataset.iloc[:, -1].values # Matrix of features of dependent variable

In [19]:
print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data
- A viable way to deal with missing data is to delete the row of data.
- Another option is to take the average of the data

Notice the missing **salary** data for the 5th entry

In [20]:
from sklearn.impute import SimpleImputer
# SimpleImputer(missing_values,strategy)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # Replace the missing value with the mean of the feature itself
imputer.fit(X[:,1:3])   # Only expects columns with numerical values
X[:,1:3] = imputer.transform(X[:,1:3])     # Performs transformation and replaces missing data with numeric mean, and returns the new data. Be sure only to change the affected 
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding catergorical data
This essentially involves turning categorical strings into numerical values. We want to avoid using encoding which would lead the model to believe there is a correlation between the categories. One solution invloves One Hot Encoding.

**One Hot Encoding** <br>
Replace the categorical column with a number of columns equal to the number of unique categories

In [21]:
# One Hot Encoding  - features

# ColumnTransformer(
#     transformers=[(typeOfTransformation,NameOfTheClassForTheTransformation,[columnsToApplyTo])],
#     remainder = what to do with columns with no transformation
#     )
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough') # 'passthrough' leaves columns as is
X = np.array(ct.fit_transform(X)) # Fit and transform the data in one step
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [22]:
# Label Encoding  - dependent variable
le =LabelEncoder()
y = le.fit_transform(y) # No need for this to be numpy array

print(y) # Binary outcome should correspond to final column of the data

[0 1 0 0 1 1 0 1 0 1]


## Feature Scaling
This is required to scale features into the same range. 
This is requried in some models to ensure that features with naturally higher numerical values do not dominate the prediction.
- Feature scaling is not required in linear regression

### Standardisation Feature Scaling
This will generally place values all in range \[-3:+3] <br>
$x_{stand} = \frac{x-\overline{x}}{\sigma(x)}$

### Normalisation Feature Scaling
This will place feature values in range \[0:1] <br>
$x_{norm} = \frac{x-min(x)}{max(x)-min(x)}$

There generally ins't a large difference in accuracy between standardisaton and normalisation


In [23]:
# Feature scaling
sc = StandardScaler()
X = sc.fit_transform(X)

In [24]:
print(X) # All the values are now in the same range

[[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01  7.58874362e-01
   7.49473254e-01]
 [-8.16496581e-01 -6.54653671e-01  1.52752523e+00 -1.71150388e+00
  -1.43817841e+00]
 [-8.16496581e-01  1.52752523e+00 -6.54653671e-01 -1.27555478e+00
  -8.91265492e-01]
 [-8.16496581e-01 -6.54653671e-01  1.52752523e+00 -1.13023841e-01
  -2.53200424e-01]
 [-8.16496581e-01  1.52752523e+00 -6.54653671e-01  1.77608893e-01
   6.63219199e-16]
 [ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -5.48972942e-01
  -5.26656882e-01]
 [-8.16496581e-01 -6.54653671e-01  1.52752523e+00  0.00000000e+00
  -1.07356980e+00]
 [ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01  1.34013983e+00
   1.38753832e+00]
 [-8.16496581e-01  1.52752523e+00 -6.54653671e-01  1.63077256e+00
   1.75214693e+00]
 [ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -2.58340208e-01
   2.93712492e-01]]


## Splitting the dataset in the Training set and Testing set
We always make sure the testing set is seperate to prevent 'overfitting' on the training set, which causes the model to be bad at predicting new scenarios. 

In [26]:
# train_test_split(X matrix,y matrix,test_size= percentage of data to be in test set, random_state=0 (Optional) )
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=0)