# Data Preprocessing Tools

## Importing the libraries

In [67]:
import numpy as np # Working with arrays
import matplotlib.pyplot as plt # Plotting graphs
import pandas as pd # Importing and managing datasets

## Importing the dataset

In [68]:
dataset = pd.read_csv('Data.csv') # Dataframe

# Iloc -> index location
# values -> returns a numpy array
X = dataset.iloc[:, :-1].values # Independent variables (features, variables which are needed to predict the dependent variable)
y = dataset.iloc[:, -1].values # Dependent variable vector(variable to predict)

# Try to have depenedend variable values in the last column

In [69]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [70]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [71]:
# Method 1: Remove the rows with missing data
# Method 2: Replace the missing data with the mean of the column

from sklearn.impute import SimpleImputer # scikit-learn
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # missing_values -> missing values format, strategy -> mean, median, most_frequent
imputer.fit(X[:, 1:3]) # Fit the imputer to the columns with missing data
X[:, 1:3] = imputer.transform(X[:, 1:3]) # Replace the missing data with the mean of the column (Remember that we changed only 2 columns, so we need to specify the columns)

In [72]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [73]:
# Encoding categorical data -> One hot encoding -> convert categorical data to 0-1 vector
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') # remainder allows to keep the rest of columns not changed, transformers [(type, object what transforms, index range)]
X = np.array(ct.fit_transform(X)) # run encoder and wrap result in np.array

In [74]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [75]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y) # Convert 'yes' to 1 and 'no' to 0

In [76]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

What is the purpose of splitting the dataset into the training set and the test set? -> to prevent overfitting - a situation where a ML model learns only on one dataset and cannot adapt to any other.


Why splitting before feature scaling? Test set should be a new set, so you should not work with it during the training.

In [77]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1) # random_state -> seed

In [78]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [79]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [80]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [81]:
print(y_test)

[0 1]


## Feature Scaling

Standardisation $\frac{x - mean(x)}{standard-deviation (x)}$ in range [-3,3] and normalization $\frac{x - min(x)}{max(x) - min(x)}$ in range [0,1]. We apply feature scaling to columns.

We use normalisation when data is from normal distribution. Standardisation is better for all kinds of data.

Remember, don't apply standardisation on dummy variables (ex. vector from encoding categorical data), only on numerical values.

Remember to use mean and st. der. from training test for test set!

In [82]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # fit -> calculate mean and st. der.; transform -> transform data
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [83]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [84]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
