# Data Pre-processing

The concept of machine learning and artificial intelligence is highly dependent on the quantity and quality of data that you have. It is this data that is used to train and evaluate your models.

Thus, it is important to ensure that your data is 'clean', i.e. there are no missing values, values are of the correct type.

Python makes it extremely easy to perform this task with the wide variety of libraries available. For this stage in ML development, the most important libaries are NumPy, Pandas and SKlearn (SciKitLearn)

First we will import the libraries and read in the file. 

In [1]:
import numpy as np
import pandas as pd

myData = pd.read_csv("Data.csv")
print(myData)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


Now we must split the data frame into dependent and indepent values (the information given and what we are trying to predict). In this scenario, given the country, age and salary, we want to predict whether a consumer purchases a product. 

Thus, whether the consumer PURCHASED something is dependent on the COUNTRY, AGE, and SALARY. 

Using the .iloc() function of Pandas, we can separate the original dataframe into subdata frames. 

In [5]:
xSet = myData.iloc[:, :-1].values       # for a slice object, the start:end is exclusive, meaning it will only go up to end-1 where end is an int
ySet = myData.iloc[:, -1]               # iloc[rows, cols]
print(ySet)

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object


As you can see, there are some values that are missing from the Age and Salary columns, denoted by NaN (Not a number). 

There are many ways of rectifying this, but one common way is to populate these cells with the averages of the rest in each cell's respective column.
Using SKlearn, we can modify the dataframe to do this rectification.

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")         # scan for missing values (NaN) and use the "mean" replacement strategy
imputer.fit(xSet[: , 1:3])
xSet[:, 1:3] = imputer.transform(xSet[:,1:3])                           # updating the xSet
print(xSet)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
