# Preprocessing

**Agenda**

- What is preprocessing and how it is done?

We will learn how to do the following:
1. Train/Test split
2. Normalization/Standardization
3. Label Encoding
4. Handling Missing Data

---


## Preprocessing overview

In simple words, pre-processing refers to the transformations applied to your data before feeding it to the algorithm 

Scikit-learn library has a pre-built functionality under **sklearn.preprocessing** that we will explore in this module

## Train-Test split

Setting portion of data aside - prerequisite for generalization

Different partitions for training and testing - A core practice in machine learning

Scikit-learn has a convenient method to assist in that process:

**train_test_split(sample, response, test_size=0.25, shuffle=True)**

The split size is controlled using the **attribute test_size** 

**Default test_size** - 25% of the dataset size

**Standard practice** - shuffle the dataset before splitting by setting the attribute **shuffle=True**


In [1]:
import numpy
import pandas as pd
import pickle
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
# split in train and test sets
iris = datasets.load_iris()
iris.data.shape

(150, 4)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, shuffle=True)
X_train.shape

(112, 4)

In [4]:
X_test.shape

(38, 4)

In [5]:
y_train.shape

(112,)

In [6]:
y_test.shape

(38,)

## To Scale or Not To Scale

Scaling is a monotonic transformation 
* The relative order of smaller to larger value in a variable is maintained post the scaling

Vast majority of algorithms require scaling

Algorithms that do not require scaling:

> Algorithms that rely on rules 
* They would not be affected by any monotonic transformations of the variables
* e.g. CART, Random Forests, Gradient Boosted Decision Trees

> Algorithms that rely on distributions of the variables
* e.g. Naive Bayes

[Standardization and Normalization](https://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization) are ways to do scaling 

## Standardization

Transforms the features into a Standard Gaussian (or normal) distribution with a mean of 0 and standard deviation of 1

Used when algorithm requires computation of distance (Euclidean) to avoid large scale features dominating others 
* e.g. KNN, K-means, Minimum distance classifier

It matters in PCA to avoid bias towards high magnitude features 

For gradient descent based algorithm standardization helps in faster convergence

In SVM it can reduce the time to find support vectors

Scikit-learn implements data standardization in the StandardScaler module

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
standardize_Xtrain = scaler.transform(X_train)

In [9]:
X_train[0:5,:]

array([[4.9, 2.4, 3.3, 1. ],
       [6. , 2.2, 4. , 1. ],
       [6.3, 2.5, 4.9, 1.5],
       [6.3, 2.7, 4.9, 1.8],
       [5.1, 3.8, 1.5, 0.3]])

In [10]:
standardize_Xtrain[0:5,:]

array([[-1.1071133 , -1.55550783, -0.19828204, -0.18186407],
       [ 0.20818428, -2.02008617,  0.18745771, -0.18186407],
       [ 0.5669018 , -1.32321866,  0.68340883,  0.46271743],
       [ 0.5669018 , -0.85864032,  0.68340883,  0.84946633],
       [-0.86796829,  1.69654054, -1.19018428, -1.08427816]])

## Exercise

Standardize X_test and print how first 5 rows of X_test and standardize_Xtest look like

In [None]:
# Standardize X_test

## Normalization

Normalization transforms the features in the dataset so that it has a unit norm or has magnitude or length of 1 

The length of a vector is the square-root of the sum of squares of the vector elements 

A unit vector (or unit norm) is obtained by dividing the vector by its length 

Note: Normalizing the dataset is particularly useful in scenarios where the dataset is sparse (i.e., a large number of observations are zeros) and also have differing scales 

Normalization in Scikit-learn is implemented in the Normalizer module

In [11]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalize_Xtrain = scaler.transform(X_train)

In [12]:
X_train[0:5,:]

array([[4.9, 2.4, 3.3, 1. ],
       [6. , 2.2, 4. , 1. ],
       [6.3, 2.5, 4.9, 1.5],
       [6.3, 2.7, 4.9, 1.8],
       [5.1, 3.8, 1.5, 0.3]])

In [13]:
normalize_Xtrain[0:5,:]

array([[0.75916547, 0.37183615, 0.51127471, 0.15493173],
       [0.78892752, 0.28927343, 0.52595168, 0.13148792],
       [0.74143307, 0.29421947, 0.57667016, 0.17653168],
       [0.73122464, 0.31338199, 0.56873028, 0.20892133],
       [0.77964883, 0.58091482, 0.22930848, 0.0458617 ]])

## Exercise

Normalize X_test and print how first 5 rows of X_test and normalize_Xtest look like

In [None]:
# Normalize X_test

## Label Encoding or Encoding Categorical Variables

Encoding - converting non-numerical features with labels into a numerical representation

Encoding in Scikit-learn using LabelEncoder - for encoding labels as integers 

We have already seen how species were encoded in iris dataset:
* 0 = setosa, 1 = versicolor, 2 = virginica

In avocado dataset:
The `region_categories` and `type_categories variables` store the unique categories in the `region` and `type` column

In [3]:
import pandas as pd
avocado_path = 'avocado.csv' #please make sure avacado.csv is in the same directory as the iPython notebook
avocado_df = pd.read_csv(avocado_path,header=0)
avocado_df.drop('Unnamed: 0', axis=1, inplace=True)

region_categories = avocado_df['region'].unique()
type_categories = avocado_df['type'].unique()

print('Region categories are: \n',region_categories,'\n')
print('Type categories are: \n',type_categories,'\n')

Region categories are: 
 ['Albany' 'Atlanta' 'BaltimoreWashington' 'Boise' 'Boston'
 'BuffaloRochester' 'California' 'Charlotte' 'Chicago' 'CincinnatiDayton'
 'Columbus' 'DallasFtWorth' 'Denver' 'Detroit' 'GrandRapids' 'GreatLakes'
 'HarrisburgScranton' 'HartfordSpringfield' 'Houston' 'Indianapolis'
 'Jacksonville' 'LasVegas' 'LosAngeles' 'Louisville' 'MiamiFtLauderdale'
 'Midsouth' 'Nashville' 'NewOrleansMobile' 'NewYork' 'Northeast'
 'NorthernNewEngland' 'Orlando' 'Philadelphia' 'PhoenixTucson'
 'Pittsburgh' 'Plains' 'Portland' 'RaleighGreensboro' 'RichmondNorfolk'
 'Roanoke' 'Sacramento' 'SanDiego' 'SanFrancisco' 'Seattle'
 'SouthCarolina' 'SouthCentral' 'Southeast' 'Spokane' 'StLouis' 'Syracuse'
 'Tampa' 'TotalUS' 'West' 'WestTexNewMexico'] 

Type categories are: 
 ['conventional' 'organic'] 



In [4]:
from sklearn.preprocessing import LabelEncoder

region_encoder = LabelEncoder()
encoded_region_cats = region_encoder.fit_transform(region_categories)
print('The encoded region categories are:', encoded_region_cats)

The encoded region categories are: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53]


In [5]:
print('Before Encoding the region column was: \n', avocado_df['region'])
print('\n')

avocado_df['region'] = region_encoder.transform(avocado_df['region'])
print('After Encoding the region column is: \n', avocado_df['region'])


Before Encoding the region column was: 
 0                  Albany
1                  Albany
2                  Albany
3                  Albany
4                  Albany
5                  Albany
6                  Albany
7                  Albany
8                  Albany
9                  Albany
10                 Albany
11                 Albany
12                 Albany
13                 Albany
14                 Albany
15                 Albany
16                 Albany
17                 Albany
18                 Albany
19                 Albany
20                 Albany
21                 Albany
22                 Albany
23                 Albany
24                 Albany
25                 Albany
26                 Albany
27                 Albany
28                 Albany
29                 Albany
               ...       
18219             TotalUS
18220             TotalUS
18221             TotalUS
18222             TotalUS
18223             TotalUS
18224             Total

## Exercise

Encode type categories print the encoded type categories and print the column before and after encoding

In [None]:
# Encoded Type Categories

In [None]:
# Print the column before, encode type categories

In [None]:
# Print the column after encoding

## Input Missing Data

It is often the case that a dataset contains several missing observations 

Scikit-learn implements the Imputer module for completing missing values

In [38]:
# We will first make some holes in our data
avocado_path = 'avocado.csv' #please make sure avacado.csv is in the same directory as the iPython notebook
avocado_df = pd.read_csv(avocado_path,header=0)
avocado_df.drop('Unnamed: 0', axis=1, inplace=True)

import random
import numpy as np

avocado_df.drop(avocado_df.columns[[0, 9, 10, 11, 12]], axis = 1, inplace = True) 

np.random.seed(100)
mask = np.random.choice([True, False], size=avocado_df.shape)
#print (mask)

mask[mask.all(1),-1] = 0
#print (mask)

#print (avocado_df.mask(mask))

#def add_random_na(row):
#    vals = row.values
#    for _ in range(random.randint(0,len(vals)-2)):
#        i = random.randint(0,len(vals)-1)
#        vals[i] = np.nan
#    return vals

avocado_df.mask(mask).head()


Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags
0,,,1036.74,54454.85,48.16,8696.87,,
1,,,,44638.81,,,,
2,0.93,,,109149.67,,8145.35,,
3,,78992.15,1132.0,71976.41,,,5677.4,
4,,51039.6,941.48,43838.39,75.78,6183.95,,


In [35]:
avocado_df.mask(mask).tail()


Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags
18244,1.63,17074.83,,,,13498.67,13066.82,431.85
18245,1.71,,1191.7,3431.5,,,8940.04,324.8
18246,1.87,,,,727.94,9394.11,9351.8,42.31
18247,,,,,727.01,10969.54,10919.54,
18248,,,2894.77,2356.13,,,11988.14,


In [37]:

from sklearn.impute import SimpleImputer 

# impute missing values - axix=0: impute along columns
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
print(imputer.fit_transform(avocado_df.mask(mask)))

[[1.40089245e+00 8.76004541e+05 1.03674000e+03 ... 8.69687000e+03
  1.78906158e+05 5.54393126e+04]
 [1.40089245e+00 8.76004541e+05 2.90554507e+05 ... 2.42302277e+05
  1.78906158e+05 5.54393126e+04]
 [9.30000000e-01 8.76004541e+05 2.90554507e+05 ... 8.14535000e+03
  1.78906158e+05 5.54393126e+04]
 ...
 [1.87000000e+00 8.76004541e+05 2.90554507e+05 ... 9.39411000e+03
  9.35180000e+03 4.23100000e+01]
 [1.40089245e+00 8.76004541e+05 2.90554507e+05 ... 1.09695400e+04
  1.09195400e+04 5.54393126e+04]
 [1.40089245e+00 8.76004541e+05 2.89477000e+03 ... 2.42302277e+05
  1.19881400e+04 5.54393126e+04]]
