# Data Preprocessing

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
X = dataset.iloc[:, :-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [4]:
y = dataset.iloc[:, 3].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## 1. Taking care of missing data
[sklean impute docs](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy = 'mean')

In [6]:
imputer = imputer.fit(X[:, 1:3])

In [7]:
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## 2. Encoding categorical data


Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set.

Categorical variables are _nominal_ or _ordinal_.

Some examples include:

 - A “pet” variable with the values: “dog” and “cat“.
 - A “color” variable with the values: “red“, “green” and “blue“.
 - A “place” variable with the values: “first”, “second” and “third“.
 
 
__Nominal categorical variable__ :  Eg: pet and color   
__Ordinal categorical variable__ : Eg: place   -> The “place” variable has a natural ordering of values.

### What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.


### How to Convert Categorical Data to Numerical Data?

1. Integer Encoding/ LabelEncoder
As a first step, each unique category value is assigned an integer value. 

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

__For ordinal categorical variables, label encoding is sufficient.__

2. One-Hot Encoding
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

The binary variables are often called “dummy variables” in other fields, such as statistics.


In [8]:
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,0]

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0], dtype=object)

But here there is an order into the categorical variables as we there is no relational order between these countries showing which one has more value than another. So use dummy variables to avoid this issue.

But better to gow with pandas dummy function. 

In [9]:
onehotencoder = OneHotEncoder(categorical_features=[0])  # index of column
X = onehotencoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [10]:
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

In [11]:
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## 3. Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [14]:
X_train

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04]])

In [15]:
X_test

array([[0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04]])

In [16]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

In [17]:
y_test

array([0, 0])

## 4. Feature scaling

Observe that the data columns __Age__ and __Salary__ are not in the same scale range so they can cause problems in our machine learning model because most of the ML models are based on the euclidean distance.

![](Euclidean-distance.PNG)

Assume Age -> x-axis  
and    Salary -> y-axis

But as Salary has a wide range so the euclidean distance will be dominated by Salary
(Age, Salary) - 1.(48, 79000) and 2. (27, 48000)  

So, => (79000 - 48000)^2 = 96100000  
    =>  (48-27)^2  = 441  
    
    So observe Salary difference is dominated by age difference. So in the ML eqns it would be like age doesnt even exist as it will be dominated by Salary. So it is VERY IMPORTANT  to put the variables in the same scale.
    
Ways of scaling the data:
![](feature_scaling.PNG)


The scaling algorithm puts your data on one scale. This is helpful with largely sparse datasets. In simple words, your data is vastly spread out. For example the values of X maybe like so:

X = [1, 4, 400, 10000, 100000]

The issue with sparsity is that it very biased or in statistical terms skewed. So, therefore, scaling the data brings all your values onto one scale eliminating the sparsity. 

Here I will be using the [ Sklearns' StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)  which uses standardisation.

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [20]:
X_train = scaler.fit_transform(X_train)
# We are recomputing X_train coz we want it to be scaled and then to the obkect scaler we call the fit_transform method
# Not done same for test coz for train set we are just transforming (i.e., no fitting is done for it)

In [21]:
X_test = scaler.transform(X_test)  # only tranform as fitting is already done for test

##### Do we really need to fit/tranform dummy ariables as it is already either 0 or 1 (i.e., like scaled)?Some will say that no you don't need to scale these dummy variables.

Some say that yes you need to do it because you want some accuracy in your predictions. And if you're interested in my opinion I would say that it depends on the context.

It depends on how much you want to keep interpretation in your models. Because if we scale this it will be good because everything will be on the same scale we will be happy with that and it will be good for our predictions but who will lose the interpretation of knowing which observations belongs to which country etc..

So as you want it won't break your model if you don't scale the dummy variables because there will be actually on the same scale as the future scales.

In [22]:
X_train

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])

OK here's X train as you can see all the variables now belong to the same range to the same scale as you can see all the variables seems to be between minus 1 and plus 1.


That will largely improve your machine remodels.

And besides even if sometimes machine models are not based on Euclidean distances we will still need to do features scaling because the algorithm will converge much faster. That will be the case for decision trees, decision trees are not based on Euclidean distances but you will see that we will need to do feature scaling because if we don't do it they will run for a very
long time.

In [26]:
X_test

array([[-1.        ,  2.64575131, -0.77459667, -1.45882927, -0.90166297],
       [-1.        ,  2.64575131, -0.77459667,  1.98496442,  2.13981082]])

OK and let's have a quick look at X test that's X test.

It's also feature scaled and you have to understand that the features scaling here.

Next test is the same as the features getting on X train simply because the object's scaler was fitted to extreme.
That's why it's important to fit the object to extra in first so that x train and X test are scales on the same basis.


## We need not apply feature scaling to y because this is a classification problem with a category called dependent variable.

But you will see that for regression when the dependent variable will take a huge range of values.

We will need to apply feature scaling to the dependent variable y as well.

In [27]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

In [28]:
y_test

array([0, 0])