# Data Preprocessing

Preprocessing your data is critical to ensuring that your machine learning method works properly, and is shoould be done at near the beginning of any data analysis. Consider backpacking in some remote woods for a few days. If you just set out, you'll most likely find when you are hungry, you'll have to figure out what nature has in her pantry. When you're thirsty, say hello to giardia as you drink water form a nearby stream. Or you could take some time do perform the boring: preparation. *That's what data preprocessing to machine learning.*

##### Doing the boring part means we can have more fun later.

Here, we will explore the essentials for preparing any dataset for any machine learning model.

## Libraries

### What is a library?

A library, in Python, is a tool that used for a specific job. Giving the library inputs allows it to do work so you don't have to. It will then give you the outputs you need in orer to complete your task. For this project we will rely on three primary libraries that will optimize our data for making efficient machine learning models.

The three libraries are:
+ Numpy: includes mathematical tools we'll need
+ Pandas: used for importing and managing data sets
+ Matplotlib: includes useful tools for plotting data; we'll primarily use a sub-library, called pyplot

*Let's begin by importing these libraries:*

In [1]:
# import libraries
import numpy as np  # the "as np" aliases numpy so that we can call it by using "np"
import pandas as pd
import matplotlib.pyplot as plt

# plots our figures inline
%matplotlib inline

## The Data

Of course we cannot process any data, let alone answer any data science question, without data. To import data, we should know in what directory our data are located. If you have cloned this repository, then data are found in the *data/* directory. Once we know where our data are located, we will use the pandas library import them into our work space. It is best to save the data as a variable so we may access the data more efficiently as we move through any analysis or model building.

In [2]:
# import data set as a pandas dataframe: data_df
data_df = pd.read_csv('data/Data.csv')

The above line of code saves our data in a dataframe object called *data_df*. It is highly recommended that you look at your data immediately after importing it. We will look at the first and last few rows of the dataframe, and its shape to start with.

In [3]:
# look at the first rows of the dataframe
data_df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [4]:
# look at the last rows of the dataframe
data_df.tail()

Unnamed: 0,Country,Age,Salary,Purchased
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
# look at the shape of the dataframe (rows = observattions, columns = fields)
data_df.shape

(10, 4)

There are, of course, a number of other things to do with your data, all more or less important for performing you initial exploratory data analysis (EDA), but our focus here is on preprocessing the data for ingesting directly into a machine learning method.

### Differentiate between features and independent variables

For this example, we want to process data for a machine learning model that will help us predict whether or not a customer will purchase a product. The dependent variable, or response, $y$, is will the customer purchase or not. This is reflected in the last column of our data set. Our machine learing model will base its decision on the independent variables, or features, $x_{n}$, in the data set. The features are the customer's *country*, their *age*, and their *salary*.

As we can see from the above peeks into the data, $x_{1}=Country$, $x_{2}=Age$, and $x_{3}=Salary$, while $y=Purchased$. We will want to separate these into matrices $\textbf{X}$ and $\textbf{y}$.

In [6]:
# take lines of the dataset from the features: X
X = data_df.iloc[:, :-1].values

# take lines of the dataset from the response: y (the last column)
y = data_df.iloc[:, -1].values

# view the separated data
print('X:', '\n', X, '\n', 'y:', '\n', y)

X: 
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]] 
 y: 
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Preparing The Data

Data, in real life, almost always has issues with it. One common issue is missing data. Python will often show missing data with 'NaN'. Whhat can we do with this missing data? There are some options available, but we should always know why to use that option. For example, we see, above, that we have some missing data in our dataframe. We could just remove the entire observation since there is something missing. This can be very dangerous, as removing data can skew you analysis and models. Other options come in the form of imputation. This can mean replacing missing data with values like the mean, or median, of the other values in that field. We won't go into when to use what method here, but *you should always know why you're handling missing data the way you are.*

With our dataset here, we have a missing value in the Age and Salary fields each. As these are numeric, and don't have any immediate eye-catching outlier, we'll just impute the missing values with the mean of the other values in the respective fields. So as to not reinvent the wheel, there is a library that already is able to impute data: Scikit-Learn. We will import this library and impute the values with mean, as we already said.

In [7]:
# import Imputer class scikit-learn sub-library, preprocessing
from sklearn.preprocessing import Imputer

# create class object
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)  # mean is default strategy, typed for education

# fit the imputer
imputer = imputer.fit(X[:, 1:3])  # fit to Age and Salary columns

# handling missing values in data
X[:, 1:3] = imputer.transform(X[:, 1:3])

Now that the missing values have been imputed, let's look at the data to see if it was done correctly.

In [8]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding Categorical Data

Our dataset has two categorical variables: Country and Purchased. Since the values in the fields are categories, they are categorical. Go figure. Machine learnng methods are mathematical models and therefore require that the inputs be numerical. This requires us to encode these categorical fields with numbers. We will do this by again calling on Scikit Learn, but instead of the Imputer class, we use the LabelEncoder class. LabelEncoder will convert our variable for us.

In [9]:
# import LabelEncoder class from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder

# create the label encoder object
labelencoder_X = LabelEncoder()

# fit the label encoder object to our data
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])  # fit to Country column and transform the values

Were the values encoded correctly?

In [10]:
print(X)

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


They were, indeed, encoded, but there is an issue here: since machine learnng models are mathematically based, the numbers corresponding to each country have the same properties as numbers. In this case, since 2 > 1 > 0, mathematically, Spain > Germany > France. Disregarding your world view, this is inaccurate since there is no relational value here. We cannot say that Spain has a higher value than France, but machine learning models will! So we will introduce *dummy variables* to retain the categorical properties in the machine learning model.

This means instead of having a single column for Country with $ 0 = France $, and so forth, we will have *three* columns corresponding to each individual country, France Germany and Spain, respectively, where the values in the columns are '1' if the observation is for that column's country, and '0' otherwise.

To create these dummy variables, we will use another class from the preprocessing sub-library, called OneHotEncoder.

In [11]:
# import OneHotEncoder class from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# create OneHotEncoder object
onehotencoder = OneHotEncoder(categorical_features=[0])

# fit to X
X = onehotencoder.fit_transform(X).toarray()

Let's see if this worked.

In [12]:
# convert X to a dataframe just for readability, not for use!
print(pd.DataFrame(X))

     0    1    2          3             4
0  1.0  0.0  0.0  44.000000  72000.000000
1  0.0  0.0  1.0  27.000000  48000.000000
2  0.0  1.0  0.0  30.000000  54000.000000
3  0.0  0.0  1.0  38.000000  61000.000000
4  0.0  1.0  0.0  40.000000  63777.777778
5  1.0  0.0  0.0  35.000000  58000.000000
6  0.0  0.0  1.0  38.777778  52000.000000
7  1.0  0.0  0.0  48.000000  79000.000000
8  0.0  1.0  0.0  50.000000  83000.000000
9  1.0  0.0  0.0  37.000000  67000.000000


Perfect! Now our machine learning models will treat this categorical data appropriately. The last categorical field we have, Purchased, is binary: either they purchased or they didn't. This means we will not need One Hot Encoding to create the dummy variables like we did with the Country field. We can just use LabelEncoder and be done.

In [13]:
# create an instance of the LabelEncoder class (make an object)
labelencoder_y = LabelEncoder()

# fit the object to our Purchase column and encode the values
y = labelencoder_y.fit_transform(y)

How's it look?

In [14]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


Awesome sauce! We have encoded our categorical variables! The dummy variable issue is very important when considering multiple linear regression models as well as multiple logistic regression models.

## Splitting Into Training and Testing Sets

Our dataset is all the data we have. We cannot train our machine learning models with the entire data set becuase we will then have no way of knowing if the model will deliver an accurate response given a previously unknown observation. This means we need to split our data set into two sets: training and testing. The training set will be used, obviously, for training our machne learning models, while the test set set (you guessed it!) will be used to test the model and determine its performance.

To do this, we will make use of the model_selection sub-library from Scikit Learn, which has a class called train_test_split. This class will take our data set and perform the split 

In [15]:
# import train_test_split class from cklearn.cross_validation
from sklearn.model_selection import train_test_split

# create the the train and test set and define them at the same time, with 20% in the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

Now let's take a look at our train and test sets. Note that we expect, since we had 10 observations, we should have 8 observations in the train set and 2 in the test.

In [16]:
print('X Training Set) \n', pd.DataFrame(X_train), ' \n', 'Y Training Set \n', pd.DataFrame(y_train))

X Training Set) 
      0    1    2          3             4
0  0.0  1.0  0.0  40.000000  63777.777778
1  1.0  0.0  0.0  37.000000  67000.000000
2  0.0  0.0  1.0  27.000000  48000.000000
3  0.0  0.0  1.0  38.777778  52000.000000
4  1.0  0.0  0.0  48.000000  79000.000000
5  0.0  0.0  1.0  38.000000  61000.000000
6  1.0  0.0  0.0  44.000000  72000.000000
7  1.0  0.0  0.0  35.000000  58000.000000  
 Y Training Set 
    0
0  1
1  1
2  1
3  0
4  1
5  0
6  0
7  1


In [17]:
print('X Test Set) \n', pd.DataFrame(X_test), ' \n', 'Y Test Set \n', pd.DataFrame(y_test))

X Test Set) 
      0    1    2     3        4
0  0.0  1.0  0.0  30.0  54000.0
1  0.0  1.0  0.0  50.0  83000.0  
 Y Test Set 
    0
0  0
1  0


Yeah, boy! We are almost there! Note that we expect our machine learning model can learn from the training set, and apply what it learned to the test set. We will be able to see how well the model learned the correlations from the trainnig set because we will know the actual outcomes (whether a customer purchased) in the test set. If the model learned the correlations from the training set *too well* without understanding the logic behind those correlations, then we will encounter a situation called *over-fitting* our model, and the model will not make good predictions. We can address this with regularization, but not in this project...

## Feature Scaling

Feature scaling is necessary for machine learning becuase features in a dataset are not in the same scale. If you look at our data, above, the Salary values are three orders of magnitude higher than the Age values. Machine learning methods use the Euclidean distance between points:

$
d = \sqrt{(x_{2}-x_{1})^{2}+(y_{2}-y_{1})^{2}}
$,

which means, in the case of Salary and Age, the squared difference in Salary will be *six* orders of magnitude higher, and will, therefore, dominate the Euclidean distance calculation and will not push out the Age feature.

So we will need to make our values the same scale. This can be accomplished in many ways, but often it is done with *standardization*, or *normalization*, where:

$
x_{stand}=\frac{x-mean(x)}{S.D.(x)},
$

and

$
x_{norm}=\frac{x-min(x)}{max(x)-min(x)}.
$

These methods will ensure that no feature dominates another.

We can do this with StandardScaler class from Scikit sub-library, preprocessing.

In [18]:
# import StandardScalar from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

# instantiate the object
sc_X = StandardScaler()

# fit and transform the training set
X_train = sc_X.fit_transform(X_train)

# only need to transform the test set because it is already fit to the training set
X_test = sc_X.transform(X_test)

##### So, do we need to fit and transform the dummy variables?

Well, if you Google it, some say yes, and others say no. Realistically, it depends. It won't break your model if you don't scale them, but if you wish to keep some interpretation in your model, as in, if we scale them, good, we retain scale for everything and good for predictions, but we lose interpretation as to which observations correspond to which Country. Since we have no interpretation to make for this project, we've scaled them.

Let's look.

In [19]:
print('X Train Set\n', X_train, '\n', 'X Test Set\n', X_test)

X Train Set
 [[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]] 
 X Test Set
 [[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]


We can see here that all the values are on the same scale. Success, yes!

##### Should we apply feature scaling to the response variable?

Nope.

Since this is a categorical response variable, the problem is of the *classification* type. If we were dealing with a regression that intakes a large number of values for the dependent variable (i.e.: more than just the '0' and '1' presented in this example), we will need to ensure we perform feature scaling on *y*.

Okey-dokey. Now stand up and do a little dance, because your data has been preprocessed and that high you get from doing machine learning is ready to happen. Yeehaw!