Data preprocessing is one of the most critical step event before start thinking or using any machine learning models. A good data preprocessing can greatly improve the performence of the models. One another hand, if data is not prepared properly then the result of any model could be just "Garbage in Garbage out". 

Below are the typical steps to process a dataset:
1. Load the dataset, in order to get a sense of the data
2. Taking care of missing data (Optional)
3. Encoding categorical data (Optional)
4. Splitting dataset into the Training set and Test set (Validation set)
5. Feature Scaling

Thanks for all the powerful libararies, today we can implement above steps very easily with Python. 

## Import the libararies

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

- **[numpy](http://www.numpy.org/)** is a popular libaray for sintific computing. Here will mainly use it's N-dimensional array object. It also has very useful linear algebra, Fourier transform, and random number capabilities
- **[matploylib](https://matplotlib.org/)** is a Python 2D plotting library which can help us to visulize the dataset 
- **[pandas](https://pandas.pydata.org/)** is a easy-to-use data structures and data analysis tools for Python. We use it to load and separat datasets.
- **[sklearn](http://scikit-learn.org/stable/)** is another libaray we will use later. It is a very powerful tool for data analysis. Due to its comprehensive tools we will introduce them indivdualy once we use them. 

## Import the dataset

In [5]:
# read a csv file by pandas
dataset = pd.read_csv('Data.csv')
# print out the loaded dataset
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
# separate the dataset into X and y
# X is independent variables. Here are columns 'Country', 'Age' and 'Salary'
X = dataset.iloc[:, :-1].values
# y is dependent variables. Here is column 'Purchased'
y = dataset.iloc[:, -1].values

In [9]:
# value of X
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [10]:
# value of y
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

If you look closely there are two missing values in the dataset. One is the age of customer 6. Another is the salary of customer 4. Most of the time we need to fullfill the missing values to make the model work. There are three main ways to do it. Using the 'mean', 'median' or 'most frequent'. Here I will implement by using the meam of each value.

In [11]:
# Import the sklearn libarary
from sklearn.preprocessing import Imputer
# Instanciate the Imputer class
# misstion_value: the place holder for the missting value, here use the default 'NaN'
# strategy: 'mean', 'median' and 'most_frequent'
# aixs: 0 - impute along columns. 1 - impute along rows.
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
# Fit column Age and Salary
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])


In [12]:
# value of X
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Here we used sklearn's Imputer module to help us to take care of the missing data. From the code you can see it become very easy by using the libaray. And the missing Age and Salary are filled with the mean value of their column.

Normalization : $$x' = \frac{x - mean(x)}{max(x) - min(x)}$$

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(20)
y = x**2

plt.plot(x, y)