# Data Preprocessing Tools

## Importing the libraries

In [14]:
# import 3 libraries: numpy (work with arrays), matplotlib (plot charts), pandas
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd

## Importing the dataset

In [15]:
# import Data.csv
dataset = pd.read_csv('Data.csv') 
# features (X): they are the columns using which you predict the dependent variable (first 3 columns in data set)
# dependent variable (y): this is what is supposed to be predicted (Y) (last column in dataset)

X = dataset.iloc[:, :-1].values
#iLoc = indexes of columns. : will select all the rows in column and , will select first 3 columns, ignoring last column
# -1 in python means last column. So :-1 will take all columns, excluding the last column

y = dataset.iloc[:, -1].values
# only extract last column

In [16]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [17]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [18]:
## if we are missing some data, lets replace the data with average of all other data. Otherwise, this can lead to errors in our learning model.
from sklearn.impute import SimpleImputer

## create instance of above class
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# np.nan means we want to replace only empty values; strategy='mean' means we want to replace empty values with mean (average) value

## apply imputer object to data; use of methods
# 1:3 => we only select numerical columns to search for missing data (age and salary), since strings can cause errors
imputer.fit(X[:, 1:3])

## transform method => used to perform replacement
X[:, 1:3] = imputer.transform(X[:, 1:3])

## check our changes
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

To categorize France, Germany and Spain from test set into categories, we will use vector based catagory. For example, Germany = 001.
We dont use Germany=0, Spain=1 and France=3, because then model may think that they are the priorities of countires.

Also we will change "purchased" column into vector.

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

## create object of ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder='passthrough')
# [0] = column key of first column, remainder='passthrough' makes sure numerical (age and salary) are not changed. 
X = np.array(ct.fit_transform(X))
# fit_transform will be used for fitting and transforming data
# X must be numpy array (requirement by python). Hence use np.array()

print(X)



[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [20]:
# Encoding for Purchased column => use of LabelEncoder

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# convert NO and YES from column to vector
y = le.fit_transform(y)

In [21]:
print(y)

# 0=no, 1=yes

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Question: Should we do splitting of dataset BEFORE feature scaling, or AFTER?
Answer: BEFORE
Why: Test set is supposed to be brand new set, on which you will apply your ML model. If we apply feature scaling before, it will apply mean & standard deviation of all values, including one in test set. 
This is done to prevent information leakage of test set.

We will make 4 set of sets (2 for training, 2 for test)
X_train = matrix of features of the training set
X_test = matrix of features of test set
y_train = dependent variable of training set (column 'Purchased')
y_test = all the Purchased decision of customers in the training set

=> Recommended data split: 80% in training set, 20% in test set (using test_size = 0.2 below)
=> By setting random_state to a fixed number (in this case, 1), the split will always be the same every time you run the code, ensuring reproducibility.

In [22]:
# use of function 'train_test_split'

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [23]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [24]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [25]:
print(y_train)
# 8 purchased decisions. These will correspond to same 8 customers in X_train

[0 1 0 0 1 1 0 1]


In [26]:

print(y_test)
# 2 purchased decisions. These will correspond to same 2 customers in X_test

[0 1]


## Feature Scaling

Feature scaling is a preprocessing step in machine learning where the range of features (input variables) is normalized to ensure they have comparable magnitudes. 

Main goal: to transform all values of set into limited range, so thats its easy to compare.

----
Standardisation:
$ X_{stand}=\frac{x\ -\ mean(x)}{standard\ deviation(x)} $

Subtract each value of your feature, by mean of all values of your feature and divide by standard deviation.

----------

Normalisation:
$ X_{norm}=\frac{x\ -\ min(x)}{max(x)\ -\ min(x)} $

Subtract each value of your feature by minimum value and divide by (maximum value of feature - minimum value of feature)

----

Question: Which one to use from above 2?

Answer: 

=> Normalisation is recommeded when you have normal distribution in most of your features. 

=> Standardisation works well all the time.

We will not do feature scaling for country name, since they are already in range of -3 to +3 (e.g. 1.0 0.0 0.0 for Spain).
But yes for salary, we should do feature scaling.


In [35]:
# use of "StandardScaler"

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

#fit our standarisation tool on our training set. Ignore first column and take next 2.
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.fit_transform(X_test[:, 3:])


In [36]:
print(X_train)

# in output, age and salary is between -3 and +3. Country name is unchanged.

[[0.0 0.0 1.0 -0.19159184384578554 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057874 -0.07013167641635415]
 [1.0 0.0 0.0 0.566708506533324 0.6335624327104545]
 [0.0 0.0 1.0 -0.3045301939022487 -0.30786617274297906]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757337]]


In [38]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
