# Data Preprocessing Tools

## Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [30]:
dataset = pd.read_csv('Data.csv')
# X is the features of the data (the factors affecting the decision) : Independent variables
# It will be usually all the data except the last column.  
X = dataset.iloc[:, :-1].values

# y is the decision made: technically called the Dependent variable. 
# It will be usually the last column and the it is a vector
y = dataset.iloc[:, -1].values

In [31]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [32]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

Here the missing data will create a lot of problems in the process as the model make predictions. So we have to avoid the missing data at any cost.

There are many approaches to handle the missing data.

1. Remove the complete row containing missing data
2. Give the average value of the columns at the positions where the data is missing.

The averaging of the missing data can be done with the **Scikit-Learn** library. **Sklearn**

Scikit Learn library has many tools related to data science. The **SimpleImputer** class in scikit learn will handle the data missing problems.
There are many options to replace the missing data such as:
1. Mean value
2. Medium value
3. Most repeated value etc..

In [6]:
from sklearn.impute import SimpleImputer
help(SimpleImputer)

Help on class SimpleImputer in module sklearn.impute._base:

class SimpleImputer(_BaseImputer)
 |  SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
 |  
 |  Imputation transformer for completing missing values.
 |  
 |  Read more in the :ref:`User Guide <impute>`.
 |  
 |  .. versionadded:: 0.20
 |     `SimpleImputer` replaces the previous `sklearn.preprocessing.Imputer`
 |     estimator which is now removed.
 |  
 |  Parameters
 |  ----------
 |  missing_values : number, string, np.nan (default) or None
 |      The placeholder for the missing values. All occurrences of
 |      `missing_values` will be imputed. For pandas' dataframes with
 |      nullable integer dtypes with missing values, `missing_values`
 |      should be set to `np.nan`, since `pd.NA` will be converted to `np.nan`.
 |  
 |  strategy : string, default='mean'
 |      The imputation strategy.
 |  
 |      - If "mean", then replace missing values using th

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

'''
Apply the imputer for all except the first (index=0) and the last. 
(The fit method except only numeriacal values,so index 0 is elimntated from the method call)
The fit method is used to apply the imputer object to the original dataset.

The fit method only calculate the replacements for the missing values...!!! In order to apply these values,
The transform method need to be used..
'''
imputer.fit(X[:, 1:3])

# In order to apply the computed missing values to the original dataframe the trasform method has to be used..
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [8]:
imputer.fit(X[:, :3])

# Will give a value error from the first column containing strings of values.

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'France'

In [9]:
imputer.fit(X[:, 1:3])

# In order to apply the computed missing values to the original dataframe the trasform method has to be used..

imputer.transform(X[:, 1:3])

array([[4.40000000e+01, 7.20000000e+04],
       [2.70000000e+01, 4.80000000e+04],
       [3.00000000e+01, 5.40000000e+04],
       [3.80000000e+01, 6.10000000e+04],
       [4.00000000e+01, 6.37777778e+04],
       [3.50000000e+01, 5.80000000e+04],
       [3.87777778e+01, 5.20000000e+04],
       [4.80000000e+01, 7.90000000e+04],
       [5.00000000e+01, 8.30000000e+04],
       [3.70000000e+01, 6.70000000e+04]])

In [10]:
# Load the values of the transform to the original data frame
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [11]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data : One hot encoding

In [12]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In order to understand the machine the context of the strings in the dataset, these data need to be converted ("Encoded") to machine understandable formats, likely integers or numbers.

Here in this dataset, we have two colums with integer values, the **Country** column and **Purchased** column. In country column, there are only three contries availble. We can simply give 0,1 & 2 or 1, 2 & 3 for these values. But if we do so, the machine learning model will interpret that as a sequence of order of any such and it may result the performance of the model. To avoid that, the **One hot encoding** method is used.

In this case, as we have three different contries, the one hot encoding will convert the original single column further into three columns, with [0,1,0] or [0,0,1] or [1,0,0] format, as per the selection of country. **One hot encoding will create binary vectors corresponding to each value**

The Purchased column is having [Yes , No] values only, hence this colum can be easily **converted into Binary values**, without using any one hot encoding method.

These conversions or encoding can be easily done using the **sklearn.compose** class and **sklearn.preprocessing** class from the *scikit learn* library.

### Encoding the Independent Variable

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

'''
The column transformer object(here ct) is declared from the class ColumnTransformer.
The class takes mainly two parameters: transformers and remainder.
Transformers will take a list of tubles in the format (name, transformer, columns). See help for more.

name : str
        Like in Pipeline and FeatureUnion, this allows the transformer and
        its parameters to be set using ``set_params`` and searched in grid
        search.
    transformer : {'drop', 'passthrough'} or estimator
        Estimator must support :term:`fit` and :term:`transform`.
        Special-cased strings 'drop' and 'passthrough' are accepted as
        well, to indicate to drop the columns or to pass them through
        untransformed, respectively.
    columns :  str, array-like of str, int, array-like of int,                 array-like of bool, slice or callable
        Indexes the data on its second axis. Integers are interpreted as
        positional columns, while strings can reference DataFrame columns
        by name.  A scalar string or int should be used where
        ``transformer`` expects X to be a 1d array-like (vector),
        otherwise a 2d array will be passed to the transformer.
        A callable is passed the input data `X` and can return any of the
        above. To select multiple columns by name or dtype, you can use
        :obj:`make_column_selector`.

remainder : {'drop', 'passthrough'} or estimator, default='drop'
    By default, only the specified columns in `transformers` are
    transformed and combined in the output, and the non-specified
    columns are dropped. (default of ``'drop'``).
    By specifying ``remainder='passthrough'``, all remaining columns that
    were not specified in `transformers` will be automatically passed
    through. This subset of columns is concatenated with the output of
    the transformers.
    By setting ``remainder`` to be an estimator, the remaining
    non-specified columns will use the ``remainder`` estimator. The
    estimator must support :term:`fit` and :term:`transform`.
    Note that using this feature requires that the DataFrame columns
    input at :term:`fit` and :term:`transform` have identical order.

We have to set remainder as passthrough to keep the values in the resultant dataframe
'''
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# The method fit_transform does the fit and transform tasks simultaneously and the resultant is not numpy array. 
# As we are working with np arrays, an additional conversion is necessary for the encoding task..
print(ct.fit_transform(X))
print(type(ct.fit_transform(X)))


[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
<class 'numpy.ndarray'>


In [14]:
# we need to convert into numpy array.. (to make sure..)
X = np.array(ct.fit_transform(X))

In [15]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [16]:
type(X)

numpy.ndarray

### Encoding the Dependent Variable

As the **Purchased** column is having only two values, a bibary encoding is enough for the columns. 
The **LabelEncoder** class from the **scikit learn, preprocessing** module will do this conversion. 

In [17]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# After declaring the object, a fit_transform method is used to apply the values
y = le.fit_transform(y)

In [18]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


### The data after encoding:

This is how the dataframe will look now after encoding

In [19]:
# Independent Variable
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [20]:
# Dependent variable
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Splitting the dataset into the Training set and Test set

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [22]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [23]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [24]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [25]:
print(y_test)

[0 1]


## Feature Scaling

Feature scaling is used to balance the data or in other words, no features in the dataset should have any particular influence on the dataset. 
There are mainly two methods for Feature scaling 
**Standardization** X stand = (x - mean value)/(standard deviation), the results will be in the range of -3 to +3 in normal cases.
**Normalization** X norm = (x- minimum)/(maximum-minumum), the results will be between 0 and 1 (always positive).

Standardization always works, but normalization works only on dataset with normal distribution. So standardization is mostly used.

The **feature scaling is applied normally after splitting the dataset**. Because otherwise the maximum and minimum of the total dataset will be different compared to the split datasets (The values of mean, standard deviation can also be different)

We **don't need to apply feature scaling for the dummy variable** we created in the dataset. For example, in this given dataset, we have converted the countires (categorical variable) to a three column dataset using one hot encoding. This data columns doesn't need to do any feature scaling. Also these variables are already in the range of -3 and +3. If we do feature scaling on these values, we may get nonsense values and confuses the interpretation of the variable. (for example, we don't know which country will be considered in this dataset, if we have a value other than 1) 




In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

'''
On the standardization function, StandardScalar; the fit method will calculate the mean and SD of the each columns.
The transform function applies the transformation to the original dataset. 

So the test_transform function does both the jobs in one step and applies on the train dataset.

As the features in the test data need to be scaled as same as the train dataset, the mean and SD has to be the same.
Hence we don't calculate the mean, SD for the test data. We use only the transform function and apply the calculated values 
on the train dataset to the test dataset.

'''
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [28]:
print(X_train)

[[0.0 0.0 1.0 -0.1915918438457856 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057902 -0.07013167641635401]
 [1.0 0.0 0.0 0.5667085065333239 0.6335624327104546]
 [0.0 0.0 1.0 -0.3045301939022488 -0.30786617274297895]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200352 -0.5646194287757336]]


In [29]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830127 -0.9069571034860731]
 [1.0 0.0 0.0 -0.44973664397484425 0.20564033932253029]]
