# Data Pre-processing

The concept of machine learning and artificial intelligence is highly dependent on the quantity and quality of data that you have. It is this data that is used to train and evaluate your models.

Thus, it is important to ensure that your data is 'clean', i.e. there are no missing values, values are of the correct type.

Python makes it extremely easy to perform this task with the wide variety of libraries available. For this stage in ML development, the most important libaries are NumPy, Pandas and SKlearn (SciKitLearn)

First we will import the libraries and read in the file. 

In [2]:
import numpy as np
import pandas as pd

myData = pd.read_csv("Data.csv")
print(myData)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


Now we must split the data frame into dependent and indepent values (the information given and what we are trying to predict). In this scenario, given the country, age and salary, we want to predict whether a consumer purchases a product. 

Thus, whether the consumer PURCHASED something is dependent on the COUNTRY, AGE, and SALARY. 

Using the .iloc() function of Pandas, we can separate the original dataframe into subdata frames. 

In [3]:
xSet = myData.iloc[:, :-1].values       # for a slice object, the start:end is exclusive, meaning it will only go up to end-1 where end is an int
ySet = myData.iloc[:, -1].values               # iloc[rows, cols]
print(ySet)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


As you can see, there are some values that are missing from the Age and Salary columns, denoted by NaN (Not a number). 

There are many ways of rectifying this, but one common way is to populate these cells with the averages of the rest in each cell's respective column.
Using SKlearn, we can modify the dataframe to do this rectification.

In [4]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")         # scan for missing values (NaN) and use the "mean" replacement strategy
imputer.fit(xSet[: , 1:3])                                              # slice object refers to the overarching dataframe
xSet[:, 1:3] = imputer.transform(xSet[:,1:3])                           # updating the xSet
print(xSet)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


So in regular english:

First we import the SimpleIputer class from the sklearn.impute library and initialize it with two conditions: what kind of missing values it should look for and how it should replace these missing values. In this case we want to replace missing values that are Not A Number (np.nan) and use the "mean" strategy (fill in the missing values with the average of all the other values within the specific column).

Then we fit the instance of the imputer onto the set we want to fix, in this case it is the XSet and the columns that have numerical values. We then update xSet by using the transform() method of the SimpleImputer class which takes in the specific rows and columns as an arguement

### Encoding The Categorical Data

Machine learning is all about math, which we can't do on categorical values such as country names (in this case). 

Instead, we can use one-hot encoding to assign binary vectors to each unique entry in the country column or any categorical volume for that. 

Binary (0's and 1's) can be used to assign a unique identity to each entry. Since there are 3 distinct countries we could represent them as numbers: [1, 2, 3].

In binary: [000, 001, 010]

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

columnTransformer = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = "passthrough")

Here we intialize a columnTransformer object of the ColumnTransformer class by giving it 2 arguements: transformers and remainders.

transformers takes in the method of transforming, an instance of the transformer class we want to provide and the columns we want to transform. Remainder is everything we watnt to leave untouched. 

In [7]:
xSet = np.array(columnTransformer.fit_transform(xSet))

# applying the fit transform and type casting the output to a numpy array
# necessary for later

print(xSet)

[[0.0 1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 1.0 0.0 50.0 83000.0]
 [0.0 1.0 0.0 0.0 37.0 67000.0]]


Notice that when encoding a field with multiple unique values N (N > 2) we must use one hot encoding, it is not enough to assign a state of 0 or 1.

But for labels such as yes or no, you CAN just represent them with 0's and 1's. And that it is what we will do for the dependent variables, ySet.

In [8]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()    # intializing an object of the Label Encoder Class
ySet = encoder.fit_transform(ySet)

print(ySet)

[0 1 0 0 1 1 0 1 0 1]


Note that LabelEncoder should just be used for encoding the dependent variables (Y). sklearn should remind you of this.

### Creating the Training and Test set

Now we must create two separate sets, one to train the actual model on and another to evaluate model performance on. 

We can use a sklearn function to split the intial data set into the two desired sets. 

In [9]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(xSet, ySet, test_size=0.2, random_state=1)

We see that the train_test_split() gives four matrices and takes in 4 arguements.

Obviously we must pass in the independent and dependent matrices, but we must also pass in a split size, as in what percentage of the original data should be reserved for training and testing. The random_state just sets the random state to constant. 

In [11]:
print(xTrain)

[[1.0 0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 1.0 0.0 50.0 83000.0]
 [0.0 1.0 0.0 0.0 35.0 58000.0]]


In [12]:
print(yTrain)

[0 1 0 0 1 1 0 1]


### Feature Scaling
To ensure that other features don't dominate others, it is important to ensure that all features are on the same scale. Remember that this must always be done AFTER splitting the data into a test and training set.

There are two methods of feature scaling: standardisation and normalisation. 

Just remember that standardisation results in values between +/- 3 and normalisation results in values between 0-1.

Depending on the shape of your data, one method may work best. As the name implies, if your data exhibits a normal distribution, normalisation is best.

Standardisation works well in any situation.

In [15]:
from sklearn.preprocessing import StandardScaler        # importing the scaler
sc = StandardScaler()                                   # creating an instance of the class
# DO NOT APPLY FEATURE SCALING ON DUMMY VARIABLES

xSet[:,3:] = sc.fit_transform(xSet[:, 3:])              # applying scaling
xTest[:,3:] = sc.transform(xTest[:,3:])             # don't need to fit

array([[0.0, 1.0, 0.0, -0.6546536707079772, 0.7588743615900191,
        0.7494732544921673],
       [1.0, 0.0, 0.0, 1.5275252316519468, -1.7115038793306814,
        -1.4381784072687536],
       [1.0, 0.0, 1.0, -0.6546536707079772, -1.2755547779917342,
        -0.8912654918285233],
       [1.0, 0.0, 0.0, 1.5275252316519468, -0.11302384108787522,
        -0.25320042381492147],
       [1.0, 0.0, 1.0, -0.6546536707079772, 0.17760889313808953,
        2.357833340847479e-16],
       [0.0, 1.0, 0.0, -0.6546536707079772, -0.5489729424268224,
        -0.5266568815350365],
       [1.0, 0.0, 0.0, 1.5275252316519468, 8.881784197001253e-17,
        -1.0735697969752667],
       [0.0, 1.0, 0.0, -0.6546536707079772, 1.3401398300419485,
        1.3875383225057691],
       [1.0, 0.0, 1.0, -0.6546536707079772, 1.6307725642679132,
        1.752146932799256],
       [0.0, 1.0, 0.0, -0.6546536707079772, -0.25834020820085757,
        0.2937124916253087]], dtype=object)