<a href="https://colab.research.google.com/github/franklin-univ-data-science/comp411/blob/master/Module05_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 6 - Building Good Training Sets – Data Preprocessing

# Overview

- [Dealing with missing data](#Dealing-with-missing-data)
- [Handling categorical data](#Handling-categorical-data)
- [Partitioning a dataset into a separate training and test set](#Partitioning-a-dataset-into-separate-training-and-test-set)
- [Bringing features onto the same scale](#Bringing-features-onto-the-same-scale)
- [Selecting meaningful features](#Selecting-meaningful-features)

In [27]:
import numpy as np
import pandas as pd

# Dealing with missing data

## Identifying missing values in tabular data

In [28]:
from io import StringIO
import sys

csv_data = \
'''
A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,
'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [29]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

## Eliminating samples or features with missing values

In [30]:
# remove rows that contain missing values
df.dropna(axis=0) # axis=0 means by row; default

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [31]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [32]:
# remove columns that contain missing values
df.dropna(axis=1) # axis=1 means by column

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


## Imputing missing values

In [33]:
# our original data
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [34]:
# impute missing values via the column mean

from sklearn.impute import SimpleImputer

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

**Question 6-1**: 

Should we impute before or after splitting data into training and test? Why is that?

## Understanding the scikit-learn transformer and estimator API

<div>
<img src="https://github.com/franklin-univ-data-science/data/blob/master/images/04_01.png?raw=true" width="600"/>
</div>

We **never fit the test data set**.

The separation of training and test set is an attempt to replicate the situation where you have past information and you are building a model which you will test on future unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only check your trained model using test data finally.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do it based on training set. Then you need to apply what you did to your training set on the test as well.

<div>
<img src="https://github.com/franklin-univ-data-science/data/blob/master/images/04_02.png?raw=true" width="400"/>
</div>

# Handling categorical data

## Nominal and ordinal features

- Ordinal features can be understood as categorical values that can be sorted or ordered. 
- Nominal features don't imply any order.

In [35]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


## Mapping categorical variables to numeric form

Python dictionary: key - value

In [36]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


For the nominal variables, we also can map them to the numeric form directly:

In [37]:
# Label encoding with sklearn's LabelEncoder

from sklearn.preprocessing import LabelEncoder

color_le = LabelEncoder()

df['color'] = color_le.fit_transform(df['color'].values)

df

Unnamed: 0,color,size,price,classlabel
0,1,1,10.1,class1
1,2,2,13.5,class2
2,0,3,15.3,class1


**Question 6-2** 

What is the problem in the above mapping of color?

## Performing one-hot encoding on nominal features

Here is the feature matrix we just built:

In [38]:
X = df[['color', 'size', 'price']].values
X

array([[ 1. ,  1. , 10.1],
       [ 2. ,  2. , 13.5],
       [ 0. ,  3. , 15.3]])

Although the color values don't come in any particular order, a learning algorithm will 
assume that green is larger than blue, and red is larger than green.

In [39]:
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']

size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


### One-hot encoding via pandas

A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create **a new dummy feature for each unique value in the nominal feature column**. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of a sample; for example, a blue sample can be encoded as blue=1, green=0, red=0. 

Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged.

In [40]:
pd.get_dummies(df[['color', 'size', 'price']])

Unnamed: 0,size,price,color_blue,color_green,color_red
0,1,10.1,0,1,0
1,2,13.5,0,0,1
2,3,15.3,1,0,0


Please note that one-hot encoding introduces **multicollinearity**, in which the features are highly correlated. 

To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue.

In [41]:
# multicollinearity guard in get_dummies
pd.get_dummies(df[['color', 'size', 'price']], drop_first=True)

Unnamed: 0,size,price,color_green,color_red
0,1,10.1,1,0
1,2,13.5,0,1
2,3,15.3,0,0


<br>
<br>

# Partitioning a dataset into separate training and test set

In [42]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()

Class labels [1 2 3]


Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [43]:
df_wine.shape

(178, 14)

In [53]:
from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values 
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

# Bringing features onto the same scale

Feature scaling is to make the different features at the similar scale.

**Question 6-3**: 

What algorithms are sensitive to feature scaling?

1. Logistic regression
2. KNN
3. Tree-based algorithms

**Normalization** refers to the rescaling of the features to a range of [0, 1]. To normalize a feature x, for each sample $x^{(i)}$, we can simply subtract the the minimized x, and divide the difference between the maximized and minimized x:

$x^{(i)}_{norm} = \frac{x^{(i)}-x_{min}}{x_{max}-x_{min}}$

In [45]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

**Standardization** can be more practical for many machine learning algorithm. Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns takes the form of a normal distribution, which makes it easier to learn the weights. Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling.

$x^{(i)}_{std} = \frac{x^{(i)}-x_{mean}}{x_{\text{standard deviation}}}$

A visual example:

In [46]:
ex = np.array([0, 1, 2, 3, 4, 5])

print('standardized:', (ex - ex.mean()) / ex.std())

# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))

standardized: [-1.46385011 -0.87831007 -0.29277002  0.29277002  0.87831007  1.46385011]
normalized: [0.  0.2 0.4 0.6 0.8 1. ]


In [47]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

<br>
<br>

# Selecting meaningful features

**Question 6-4** 

In the following two charts:
1. Which is high variance, which is low variance?
2. Which is high bias, which is low bias?
3. Which is overfitting?


<div>
<img src="https://github.com/franklin-univ-data-science/data/blob/master/images/bias_variance.png?raw=true" width="600"/>
</div>

Usually, the reason for the overfitting is that our model is too complex, and the fitting of training data is too good. Common solutions to reduce the generalization error are listed as follows:

- Collect more training data
- Introduce a penalty for complexity via regularization
- Choose a simpler model with fewer parameters
- Reduce the dimensionality of the data

Here we will discuss how to use regularization and feature selection to reduce the dimensionality (decrease the number of features).

## L1 and L2 regularization as penalties against model complexity

The basic idea of regularization: introduce the bias in the cost function.

Usually we use two regularizations: L1 and L2.

L2: $||w||^2_2 = \sum_{j=1}^m w_j^2$

L1: $||w||_1 = \sum_{j=1}^m |w_j|$


For regularized models in scikit-learn that support L1 regularization, we can simply set the `penalty` parameter to `'l1'` to obtain a sparse solution (many $w_j$ are zero). Then, only the important features are selected.

In [48]:
from sklearn.linear_model import LogisticRegression
LogisticRegression(penalty='l1')

Applied to the standardized Wine data:

In [49]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=1.0, multi_class='auto', solver='liblinear')
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

Training accuracy: 1.0
Test accuracy: 1.0


The weight array that we accessed via the lr.coef_ attribute contains three rows of weight coefficients, one weight vector for each class. Each row consists of 13 weights (there are 13 features in the wine data). You can see many w is zero.

In [50]:
lr.coef_

array([[ 1.24588709,  0.1811891 ,  0.74188464, -1.1599387 ,  0.        ,
         0.        ,  1.17764303,  0.        ,  0.        ,  0.        ,
         0.        ,  0.54025137,  2.51162868],
       [-1.53744228, -0.38707589, -0.99560534,  0.36487634, -0.05950696,
         0.        ,  0.66814793,  0.        ,  0.        , -1.933852  ,
         1.23396174,  0.        , -2.23189633],
       [ 0.13587886,  0.16832847,  0.35726326,  0.        ,  0.        ,
         0.        , -2.43799169,  0.        ,  0.        ,  1.56363602,
        -0.81893931, -0.49237373,  0.        ]])

There are other methods for feature selection. We ignore the **sequential feature selection** because its performance is not good and barely used. We will discuss **dimensionality** reduction later.

## Assessing feature importance with Random Forests

**Random Forest:**

Random forest is to average multiple decision trees that individually suffer from high variance, to build a more robust model that has a better generalization performance and is less susceptible to overfitting. 

The random forest algorithm can be summarized in four simple steps:

1. Draw a random bootstrap sample of size n (randomly choose n samples from the training set with replacement).
2. Grow a decision tree from the bootstrap sample. At each node:
 - a. Randomly select $d$ features without replacement.
 - b. Split the node using the feature that provides the best split according to the objective function, for instance, maximizing the information gain.
3. Repeat the above steps many times to get a forest.
4. Aggregate the prediction by each tree to assign the class label by majority vote.


In [51]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

print(importances)

feature_labels = df_wine.columns[1:]
print(feature_labels)

[0.11852942 0.02564836 0.01327854 0.02236594 0.03135708 0.05087243
 0.17475098 0.01335393 0.02556988 0.1439199  0.058739   0.13616194
 0.1854526 ]
Index(['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
       'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
       'Proanthocyanins', 'Color intensity', 'Hue',
       'OD280/OD315 of diluted wines', 'Proline'],
      dtype='object')


In [52]:
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feature_labels[indices[f]], 
                            importances[indices[f]]))

 1) Proline                        0.185453
 2) Flavanoids                     0.174751
 3) Color intensity                0.143920
 4) OD280/OD315 of diluted wines   0.136162
 5) Alcohol                        0.118529
 6) Hue                            0.058739
 7) Total phenols                  0.050872
 8) Magnesium                      0.031357
 9) Malic acid                     0.025648
10) Proanthocyanins                0.025570
11) Alcalinity of ash              0.022366
12) Nonflavanoid phenols           0.013354
13) Ash                            0.013279
