# Data Preprocessing Tools

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset
For now I know only this method how to read data - with full path 

Google Colab can colaborate with GitHub, but problem is loading data files to the notebooks. 

- features (also independant variables), data that contained in columns
- dependent variable - value we want to predict

In [2]:
dataset = pd.read_csv('c:/Users/to068616/Disk Google/Colab Notebooks/UDEMY - 1 - data_preprocessing_tools/Data.csv')

In [3]:
x = dataset.iloc[:, :-1].values #take all raws and all columns but the last
y = dataset.iloc[:, -1].values #take all raws and take the last column

In [4]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [5]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data
Few methods
- if less then 1% of data missing we can ignore them
- we can replace them (for example with average, medium, most frequent)

OK, so here come first tool I've never know so far - **SimpleImputer** from **scikit-learn** library. SimpleImputer replace all missing_values (more specifically by *missing_values* argument) by some value specified by *strategy* argument.

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [8]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data
We want to avoid to replace categorical data (like 'Male' and 'Female') by number because learning models can create some numerical order between them and this could misinterpret some correlation.

*Example: Male would be 0 and Female would be 1. Male is less than Female.*

### Handling independent variable
Firstly, let's handle Country column. It is a column with categorical data. We create so called **dummy variables**.

**Dummy variable trap** - if there is a strong correlation between two dummy variable, we should omit one of them in a model (because it is redundant). But model we will use usually avoid this trap.

*Example - two dummy variables Male/Female. It is very clear that the one who is not Male is Female.*

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x)) # we need to transform to np array

print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Handling dependent variable
and secondo let's handle the dependant variable (y)

In [17]:
from sklearn.preprocessing import LabelEncoder # we can use label encoding when we have binary data (two categories)
le = LabelEncoder()
y = le.fit_transform(y)

print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set
Recommanded sizes
80% train size
20% test size


In [18]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=1)

In [19]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [20]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [21]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [22]:
print(y_test)

[0 1]


## Feature Scaling
This has to be after splitting the dataset, because we don't want to influence Test set by data from Training set.
And we do this because some models (not all, for example regressions) don't want to some feature dominate.

*Example: age 27 is much less than salary (or whatever) with value 83000.*

Let's use two techniques:
**Standardisation** (always works) and **Normalisation** (works with normal distribution, which features usually are - but not always).
Let's use **Standardisation** respectively.
We don't use it to **dummy variables**.

In [23]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])#we use the seam scaler, we don't need to train new scaler

In [25]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [27]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
