# Data Preprocessing Tools

## Importing the libraries

We have to import these three library for data preprocessing.
1.numpy-:numpy stands for (Numerical Python) and it is a library for working with arrays.

2.pandas-:pandas is an open source library that is built on top of numpy library.It is python package that provide various data structure and operation for manuipulating numerical. It is also used to import data set.

3.matplotlib-: matplotlib is a python library used for 2D graphics in python

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In first line we call method "read_csv" which present in pandas and it takes argument that is path of dataset.

X is known as design matrix which contain matrix of input feature and it can formed by taking all coloumn except last(because last column is outfeature)

y contain last column of out dataset.

In [0]:
dataset=pd.read_csv('Data.csv')
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

This is how design matrix X looks like.

In [0]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


This is how output feature(last column) looks like

In [0]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In design matrix X we can some see some field where are written "nan" which means null values. So before going to apply out machine learning model on dataset we have to fill thsese null field.We can do that in different approach.

I-:fill with mean value.
II-:fill with median value.
III-:fill with most frequent element in X.

Here I am going to 'mean' strategy


In [0]:
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])

After filling null field with mean value our design matrix will be changed and looks like as below

In [0]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

It may be out data set may contain some feature(column) in categorical fashion.(In out data set country name and purchased column contain categorical data). So we have to encode it in numerical Type. There are three ways for that.

I-:Integer Coding ("Where each unique label is mapped to an integer")

II-:One Hot Encoding ("Where each label is mapped to binary number")

III-:Learned Embedding 

Here I am using One Hot Encoding

In [0]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers =[('encode',OneHotEncoder(),[0])] , remainder = 'passthrough')
X=np.array(ct.fit_transform(X))

After applying One Hot Encoding out design matrix will be completely changed.

In [0]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

Like design matrix we have also apply One Hot Encoding for our output feature it it contain any categorical data

In [0]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Our new output feature after One Hot Encoding.

In [0]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

We have to divide out data set into two part. first part that is called Training Set and it is used to train out model. Second part is called test_test which used to test the model how accurately predict output. 

In [0]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

It print Training data set.

In [0]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [0]:
print(y_train)

[0 1 0 0 1 1 0 1]


It print Test data set.

In [0]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [0]:
print(y_test)

[0 1]


## Feature Scaling

Feature Scalling is an important step of data preprocessing. Feature Scaling makes all data in such way that they lie in same scale usually -3 to +3.

In out data set some field have small value and some field have large value. If we apply out machine learning model without feature scaling then prediction our model have high cost(It does because small value are dominated by large value). 
So before apply model we have to perform feature scaling.

We can perform feature scaling in two ways.

I-:Standardizaion
    x=(x-mean(X))/standard deviation(X)
II-:Normalization-:
    x=(x-min(X))/(max(X)-min(X))

In [0]:
from sklearn.preprocessing import StandardScaler
sc =  StandardScaler()
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
X_test[:,3:] = sc.transform(X_test[:,3:])

In [0]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [0]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
