# Preprocessing

## Agenda
- What is preprocessing and how it is done?

In [1]:
import numpy
import pandas as pd
import pickle
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

## sklearn.preprocessing

In simple words, pre-processing refers to the transformations applied to your data before feeding it to the algorithm 

Scikit-learn library has a pre-built functionality under **sklearn.preprocessing** that we will explore in this module

## Train-Test split

Before working on anything we need to make sure that we set a portion of data aside to be able to say anything about what we can do with the unseen data 

A core practice in machine learning is to split the dataset into diffent partitions for training and testing

Scikit-learn has a convenient method to assist in that process:

train_test_split(sample, response, test_size=0.25, shuffle=True)

The split size is controlled using the attribute test_size. By default, test_size is set to 25% of the dataset size. It is standard practice to shuffle the dataset before splitting by setting the attribute shuffle=True


In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
# split in train and test sets
iris = datasets.load_iris()
iris.data.shape

(150, 4)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, shuffle=True)
X_train.shape

(112, 4)

In [4]:
X_test.shape

(38, 4)

In [5]:
y_train.shape

(112,)

In [6]:
y_test.shape

(38,)

## Scaling or not scaling

Scaling is a monotonic transformation - the relative order of smaller to larger value in a variable is maintained post the scaling.

Normalization and Standardization are explained below

Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not be affected by any monotonic transformations of the variables.  e.g. CART, Random Forests, Gradient Boosted Decision Trees etc. 

Also, Algorithms that rely on distributions of the variables, like Naive Bayes also do not need scaling.

## Standardization

Standardization transforms the features into a Standard Gaussian (or normal) distribution with a mean of 0 and standard deviation of 1

Standardization is used when algorithm requires computation of distance (Euclidean) to avoid large scale features dominating others (e.g. KNN, K-means, Minimum distance classifier)

It matters in PCA to avoid bias towards high magnitude features. For gradient descent based algorithm, feature scaling helps in faster convergence, in SVM it can reduce the time to find support vectors.

Scikit-learn implements data standardization in the StandardScaler module

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
standardize_Xtrain = scaler.transform(X_train)

In [9]:
X_train[0:5,:]

array([[4.9, 2.4, 3.3, 1. ],
       [6. , 2.2, 4. , 1. ],
       [6.3, 2.5, 4.9, 1.5],
       [6.3, 2.7, 4.9, 1.8],
       [5.1, 3.8, 1.5, 0.3]])

In [10]:
standardize_Xtrain[0:5,:]

array([[-1.1071133 , -1.55550783, -0.19828204, -0.18186407],
       [ 0.20818428, -2.02008617,  0.18745771, -0.18186407],
       [ 0.5669018 , -1.32321866,  0.68340883,  0.46271743],
       [ 0.5669018 , -0.85864032,  0.68340883,  0.84946633],
       [-0.86796829,  1.69654054, -1.19018428, -1.08427816]])

## Exercise

Standardize X_test and print how first 5 rows of X_test and standardize_Xtest look like

## Normalization

Normalization transforms the features in the dataset so that it has a unit norm or has magnitude or length of 1 

The length of a vector is the square-root of the sum of squares of the vector elements 

A unit vector (or unit norm) is obtained by dividing the vector by its length 

Note: Normalizing the dataset is particularly useful in scenarios where the dataset is sparse (i.e., a large number of observations are zeros) and also have differing scales. 

Normalization in Scikit-learn is implemented in the Normalizer module

In [11]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalize_Xtrain = scaler.transform(X_train)

In [12]:
X_train[0:5,:]

array([[4.9, 2.4, 3.3, 1. ],
       [6. , 2.2, 4. , 1. ],
       [6.3, 2.5, 4.9, 1.5],
       [6.3, 2.7, 4.9, 1.8],
       [5.1, 3.8, 1.5, 0.3]])

In [13]:
normalize_Xtrain[0:5,:]

array([[0.75916547, 0.37183615, 0.51127471, 0.15493173],
       [0.78892752, 0.28927343, 0.52595168, 0.13148792],
       [0.74143307, 0.29421947, 0.57667016, 0.17653168],
       [0.73122464, 0.31338199, 0.56873028, 0.20892133],
       [0.77964883, 0.58091482, 0.22930848, 0.0458617 ]])

## Exercise

Normalize X_test and print how first 5 rows of X_test and normalize_Xtest look like

## Label Encoding or Encoding Categorical Variables

https://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization

Encoding categorical variables is the technique for converting non-numerical features with labels into a numerical representation for use in machine learning modeling 

Scikit-learn provides modules for encoding categorical variables including the LabelEncoder, for encoding labels as integers 

LabelEncoder is typically used on the target variable to transform a vector of hashable categories (or labels) into an integer representation by encoding label with values between 0 and the number of categories minus 1. This is further illustrated in Figure

![LabelEncoder](LabelEncoder.png)

In [2]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# create dataset
LabelEncodingdata = np.array([[5,8,"calabar"],[9,3,"uyo"],[8,6,"owerri"],
                    [0,5,"uyo"],[2,3,"calabar"],[0,8,"calabar"],
                    [1,8,"owerri"]])
LabelEncodingdata

array([['5', '8', 'calabar'],
       ['9', '3', 'uyo'],
       ['8', '6', 'owerri'],
       ['0', '5', 'uyo'],
       ['2', '3', 'calabar'],
       ['0', '8', 'calabar'],
       ['1', '8', 'owerri']], dtype='<U11')

In [3]:
# separate features and target
X = LabelEncodingdata[:,:2]
y = LabelEncodingdata[:,-1]

# encode y
encoder = LabelEncoder()
encode_y = encoder.fit_transform(y)

# adjust dataset with encoded targets
LabelEncodingdata[:,-1] = encode_y
LabelEncodingdata

array([['5', '8', '0'],
       ['9', '3', '2'],
       ['8', '6', '1'],
       ['0', '5', '2'],
       ['2', '3', '0'],
       ['0', '8', '0'],
       ['1', '8', '1']], dtype='<U11')

## Input Missing Data

It is often the case that a dataset contains several missing observations 

Scikit-learn implements the Imputer module for completing missing values

In [8]:
from sklearn.impute import SimpleImputer 

# create dataset
Missing_data = np.array([[5,np.nan,8],[9,3,5],[8,6,4],
                    [np.nan,5,2],[2,3,9],[np.nan,8,7],
                    [1,np.nan,5]])
Missing_data

array([[ 5., nan,  8.],
       [ 9.,  3.,  5.],
       [ 8.,  6.,  4.],
       [nan,  5.,  2.],
       [ 2.,  3.,  9.],
       [nan,  8.,  7.],
       [ 1., nan,  5.]])

In [13]:
# impute missing values - axix=0: impute along columns
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit_transform(Missing_data)

array([[5., 5., 8.],
       [9., 3., 5.],
       [8., 6., 4.],
       [5., 5., 2.],
       [2., 3., 9.],
       [5., 8., 7.],
       [1., 5., 5.]])