# Data Preprocessing and Feature Selection

## Loading Data 

Scikit-Learn works only on numeric data stored as "numpy arrays" or "scipy sparse matrices".

3 ways to load data into Scikit-Learn

* Pandas DataFrame that consist of all numeric values of variables can be used too after converting categorical variables to numeric values.
* loadtxt() function from numpy arrays - here too variables should contain numeric values
* "sklearn-pandas" external library

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display  #To display multiple outputs from a cell
pd.set_option('display.max_rows', 40) #Limit max number of rows to display
pd.set_option('expand_frame_repr', False) #To display all columns in a single horizontal view without line wraping
import warnings
warnings.simplefilter('ignore')

### Handling Categorical variables - Ordinal
There is information present in the order of values of the categorical variable. We must not loose that information while performing the transformation

In [2]:
df = pd.read_csv("./TITANIC_DP_FS.csv")
display(df.dtypes)
df = df.drop('Name', axis=1)

class_mapping = {'1st': 3,
                '2nd': 2,
                '3rd': 1}

survival_mapping = {'Survived': 1,
                   'Died': 0}

df['Class'] = df['Class'].map(class_mapping)
df['Survival'] = df['Survival'].map(survival_mapping)
display(df)


Name         object
Age         float64
Gender       object
Class        object
Fare        float64
Survival     object
dtype: object

Unnamed: 0,Age,Gender,Class,Fare,Survival
0,29.0,Female,3,211.34,1
1,1.0,Male,3,151.55,1
2,2.0,Female,3,151.55,0
3,30.0,Male,3,151.55,0
4,25.0,Female,3,151.55,0
5,48.0,Male,3,26.55,1
6,63.0,Female,3,77.96,1
7,39.0,Male,3,0.00,0
8,53.0,Female,3,51.48,1
9,71.0,Male,3,49.50,0


### Handling Categorical variables - Nominal
There is no information present in the order of values of the categorical variable. We must not introduce any information while performing the transformation

In [3]:
#df = pd.read_csv("C:\Users\prati\Desktop\My_Creations\Data_Preprocessing-Scikit\TITANIC_NM.csv")
#df = df.drop('Name', axis=1)

df_temp = pd.get_dummies(df[['Gender']])
display(df_temp)

df = df.drop('Gender', axis=1)
display(df)


Unnamed: 0,Gender_Female,Gender_Male
0,1,0
1,0,1
2,1,0
3,0,1
4,1,0
5,0,1
6,1,0
7,0,1
8,1,0
9,0,1


Unnamed: 0,Age,Class,Fare,Survival
0,29.0,3,211.34,1
1,1.0,3,151.55,1
2,2.0,3,151.55,0
3,30.0,3,151.55,0
4,25.0,3,151.55,0
5,48.0,3,26.55,1
6,63.0,3,77.96,1
7,39.0,3,0.00,0
8,53.0,3,51.48,1
9,71.0,3,49.50,0


In [4]:
df = pd.concat([df_temp, df], axis = 1)
display(df)

Unnamed: 0,Gender_Female,Gender_Male,Age,Class,Fare,Survival
0,1,0,29.0,3,211.34,1
1,0,1,1.0,3,151.55,1
2,1,0,2.0,3,151.55,0
3,0,1,30.0,3,151.55,0
4,1,0,25.0,3,151.55,0
5,0,1,48.0,3,26.55,1
6,1,0,63.0,3,77.96,1
7,0,1,39.0,3,0.00,0
8,1,0,53.0,3,51.48,1
9,0,1,71.0,3,49.50,0


In [5]:
display(df.iloc[:, 0:5].values)

display(df.iloc[:, 5:6].values)

array([[   1.  ,    0.  ,   29.  ,    3.  ,  211.34],
       [   0.  ,    1.  ,    1.  ,    3.  ,  151.55],
       [   1.  ,    0.  ,    2.  ,    3.  ,  151.55],
       ..., 
       [   0.  ,    1.  ,   27.  ,    1.  ,    7.23],
       [   0.  ,    1.  ,   27.  ,    1.  ,    7.23],
       [   0.  ,    1.  ,   29.  ,    1.  ,    7.88]])

array([[1],
       [1],
       [0],
       ..., 
       [0],
       [0],
       [0]], dtype=int64)

In [6]:

X, y = df.iloc[:, 0:5].values, df.iloc[:, 5:6].values




In [7]:
X.shape

(1045L, 5L)

In [8]:
y.shape

(1045L, 1L)

In [9]:
X   # 2-D array

array([[   1.  ,    0.  ,   29.  ,    3.  ,  211.34],
       [   0.  ,    1.  ,    1.  ,    3.  ,  151.55],
       [   1.  ,    0.  ,    2.  ,    3.  ,  151.55],
       ..., 
       [   0.  ,    1.  ,   27.  ,    1.  ,    7.23],
       [   0.  ,    1.  ,   27.  ,    1.  ,    7.23],
       [   0.  ,    1.  ,   29.  ,    1.  ,    7.88]])

In [10]:
y  # vector

array([[1],
       [1],
       [0],
       ..., 
       [0],
       [0],
       [0]], dtype=int64)

In [11]:
X[[0, 1, 2]]   # rows 0,1,2

array([[   1.  ,    0.  ,   29.  ,    3.  ,  211.34],
       [   0.  ,    1.  ,    1.  ,    3.  ,  151.55],
       [   1.  ,    0.  ,    2.  ,    3.  ,  151.55]])

In [12]:
X[:5]   # 5 first rows

array([[   1.  ,    0.  ,   29.  ,    3.  ,  211.34],
       [   0.  ,    1.  ,    1.  ,    3.  ,  151.55],
       [   1.  ,    0.  ,    2.  ,    3.  ,  151.55],
       [   0.  ,    1.  ,   30.  ,    3.  ,  151.55],
       [   1.  ,    0.  ,   25.  ,    3.  ,  151.55]])

In [13]:
X[500:510, 0]  # values from row 500 to row 510 at column 0

array([ 0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.])

In [14]:
y[0:2]  # first two values

array([[1],
       [1]], dtype=int64)

In [15]:
y[:5]  # first five values

array([[1],
       [1],
       [0],
       [0],
       [0]], dtype=int64)

In [16]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    

display(X_train)
display(y_train)




array([[  1.  ,   0.  ,  37.  ,   1.  ,   9.59],
       [  1.  ,   0.  ,  16.  ,   3.  ,  86.5 ],
       [  1.  ,   0.  ,   9.  ,   1.  ,  15.25],
       ..., 
       [  0.  ,   1.  ,  22.  ,   1.  ,   7.78],
       [  0.  ,   1.  ,  26.  ,   1.  ,   7.89],
       [  1.  ,   0.  ,  30.  ,   1.  ,   6.95]])

array([[0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
    

## API design in Scikit-Learn

All APIs in scikit-learn can be grouped in following classes
* Estimator - Any API containing fit() method. Generally, contain fit() and predict() methods. Can have transform() method too.
* Transformer - Any API containing transform() method. Generally, contain fit() and transform() methods. Hence, subset of estimator class.

In general, all APIs in scikit-learn can be called as estimator as they all estimate using fit() method

## Scaling features

In [17]:
# Normalization


from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)  # fit + transform
X_test_norm = mms.transform(X_test)        # only transform

# It should be kept in mind while performing distance based methods (also squared-error based ones) we must attempt to scale 
# the data, so that the feature with lesser significance might not end up dominating the objective function due to its larger range. 
# In addition, features having different unit should also be scaled thus providing each feature equal initial weightage and at 
# the end we will have a better prediction model.


display(mms.data_min_ )    # Per feature minimum seen in the data (training data - as fit is applied only on training data)
display(mms.data_max_)     # Per feature maximum seen in the data  (training data)
display(mms.data_range_)   # Per feature range (data_max_ - data_min_) seen in the data (training data)

array([ 0.,  0.,  0.,  1.,  0.])

array([   1.  ,    1.  ,   76.  ,    3.  ,  512.33])

array([   1.  ,    1.  ,   76.  ,    2.  ,  512.33])

In [18]:
# Standardization

# Standarization can be more practical than MinMax Scaler in many algorithms as they expect the data to have normal distribution
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)   # fit + transform
X_test_std = stdsc.transform(X_test)         # only transform

# Elements such as l1 ,l2 regularizer in linear models (logistic comes under this category) and RBF kernel in SVM in
# objective function of learners assumes that all the features are centered around zero and have variance in the same order.

display(stdsc.mean_)     # Per feature mean seen in the data (training data)
display(stdsc.var_)      # Per feature variance seen in the data (training data)

array([  0.37209302,   0.62790698,  30.16689466,   1.8125855 ,  37.76997264])

array([  2.33639805e-01,   2.33639805e-01,   1.96100737e+02,
         7.21373753e-01,   3.31483610e+03])

In [19]:
# Robust Scaling - robust to outliers
# This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). 
# The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

from sklearn.preprocessing import RobustScaler

rbtsc = RobustScaler()
X_train_rbt = rbtsc.fit_transform(X_train)   # fit + transform
X_test_rbt = rbtsc.transform(X_test)         # only transform

display(rbtsc.center_)  # Per feature median seen in the data (training data)
display(rbtsc.scale_)   # Per feature interquartile range seen in the data (training data)

array([  0.  ,   1.  ,  29.  ,   2.  ,  15.85])

array([  1. ,   1. ,  18. ,   2. ,  28.7])

Note : Choosing between different methods of scaling features is a confusing choice, you have to dive deeper in your data and learner that you are going to use to reach the decision. For starters, you can try all the above methods and check cross validation score for making a choice.

## Feature Selection

### Univariate methods
* Univariate feature selection methods - each feature is evaluated independently with respect to the response variable. 
* Univariate feature selection is in general best to get a better understanding of the data, its structure and characteristics.
* Not a recommended way since it does not take care of correlation between features

### Linear Models and Regularization
* The idea is that the idea that when all features are on the same scale, Coefficients of regression models can be used for selecting and interpreting features the most important features should have the highest coefficients in the model,  
while features uncorrelated with the output variables should have coefficient values close to zero

* When there are multiple (linearly) correlated features (as is the case with very many real life datasets), the model becomes unstable, meaning that small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult (so called multicollinearity problem)

* This applies to linear regression as well as logistic regression 

### Regularized Models
* Regularization is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalization. Instead of minimizing a loss function E(X,Y), the loss function to minimize becomes E(X,Y)+α∥w∥, where w is the vector of model coefficients, ∥⋅∥ is typically L1 or L2 norm and α is a tunable free parameter, specifying the amount of regularization (so α=0 implies an unregularized model)
* For regression models, the two widely used regularization methods are L1 and L2 regularization, also called lasso and ridge regression when applied in linear regression as well as logistic regression

### L1 regularization / Lasso
* L1 regularization adds a penalty α∑|wi|  to the loss function (L1-norm). Since each non-zero coefficient adds to the penalty, it forces weak features to have zero as coefficients. Thus L1 regularization produces sparse solutions, inherently performing feature selection.
* If we increase α further, the solution would be sparser and sparser, i.e. more and more features would have 0 as coefficients
* L1 regularized regression is unstable in a similar way as unregularized linear models are, meaning that the coefficients (and thus feature ranks) can vary significantly even on small data changes when there are correlated features in the data

In [20]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
print X_train_std.shape
print y_train.shape
lasso.fit(X_train_std, y_train)
print "Lasso coef: ", lasso.coef_
print "Names: ", df.columns[df.columns != "Survival"]
print "Lasso Model: ", zip(df.columns[df.columns != "Survival"],map(lambda x: round(x, 6), lasso.coef_))

(731L, 5L)
(731L, 1L)
Lasso coef:  [  2.25089898e-01  -3.15904773e-16  -5.45199511e-02   1.44097071e-01
   7.87665645e-03]
Names:  Index([u'Gender_Female', u'Gender_Male', u'Age', u'Class', u'Fare'], dtype='object')
Lasso Model:  [('Gender_Female', 0.22509), ('Gender_Male', -0.0), ('Age', -0.05452), ('Class', 0.144097), ('Fare', 0.007877)]


In [21]:
from sklearn.linear_model import LogisticRegression
lasso_logistic = LogisticRegression(penalty='l1', C=0.1) 
lasso_logistic.fit(X_train_std, y_train)
print "Lasso Logistic coef: ", lasso_logistic.coef_
print "Names: ", df.columns[df.columns != "Survival"]

# penalty - can be 'l1' or 'l2'
# C - Inverse of regularization strength (1/α), smaller values specify stronger regularization


Lasso Logistic coef:  [[  8.31014405e-05  -1.08095979e+00  -2.93505380e-01   7.91465302e-01
    4.09594372e-02]]
Names:  Index([u'Gender_Female', u'Gender_Male', u'Age', u'Class', u'Fare'], dtype='object')


### L2 regularization / Ridge regression
* L2 regularization (called ridge regression for linear regression) adds the L2 norm penalty (α∑w<sup>2</sup>) to the loss function  
* Since the coefficients are squared in the penalty expression, it has a different effect from L1-norm, namely it forces the coefficient values to be spread out more equally. For correlated features, it means that they tend to get similar coefficients
* Example: Y=X1+X2, with strongly correlated X1 and X2, then for L1, the penalty is the same whether the learned model is Y=1∗X1+1∗X2 or Y=2∗X1+0∗X2. In both cases the penalty is 2∗α. For L2 however, the first model’s penalty is 1<sup>2</sup>+1<sup>2</sup>=2α, while for the second model is penalized with 2<sup>2</sup>+0<sup>2</sup>=4α
* The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularized or L1 models)
* While L2 regularization does not perform feature selection the same way as L1 does, it is more useful for feature "interpretation": a predictive feature will get a non-zero coefficient, which is often not the case with L1

In [22]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=10)
ridge.fit(X_train_std, y_train)
print "Ridge coef: ", ridge.coef_
print "Names: ", df.columns[df.columns != "Survival"]


Ridge coef:  [[ 0.11503442 -0.11503442 -0.06838653  0.1535824   0.01403434]]
Names:  Index([u'Gender_Female', u'Gender_Male', u'Age', u'Class', u'Fare'], dtype='object')


### Random Forest
* Random forest has in-built feature selection characteristics
* Note: If two or more features are highly correlated, one feature may be ranked very highly while the other correlated feature may be ranked lower. This is not an issue when we are concerned with prediction accuracy but it can lead to incorrect conclusion about feature importances when interpretation is concerned
* One thing to point out though is that the difficulty of interpreting the importance/ranking of correlated variables is not random forest specific, but applies to most model based feature selection methods

In [23]:
from sklearn.ensemble import RandomForestClassifier
feature_labels = df.columns[:-1]
rf = RandomForestClassifier(n_estimators=1000)  # no. of trees to be used in random forest
rf.fit(X_train_std, y_train)
print "Feature Importances: ", rf.feature_importances_
print "Ranked List of feature importances: ", sorted(zip(df.columns[:-1],rf.feature_importances_ ), key=lambda x:x[1], reverse=True)

Feature Importances:  [ 0.13770352  0.14146328  0.3019057   0.09055831  0.32836919]
Ranked List of feature importances:  [('Fare', 0.32836918531301218), ('Age', 0.30190570401439798), ('Gender_Male', 0.14146328414486936), ('Gender_Female', 0.13770351832408353), ('Class', 0.090558308203637153)]


### Recursive Feature Elimination
* First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached
* The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. Just as non-regularized regression can be unstable, so can RFE when utilizing it, while using ridge regression can provide more stable results.

In [24]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
rfe = RFE(LinearRegression(), n_features_to_select=1) # rank all features, i.e continue the elimination until the last one
rfe.fit(X_train_std, y_train)
display(rfe.ranking_)
display(df.columns[:-1])
print "Ranked List of feature importances: ", sorted(zip(df.columns[:-1],rfe.ranking_ ), key=lambda x:x[1])

array([3, 1, 4, 2, 5])

Index([u'Gender_Female', u'Gender_Male', u'Age', u'Class', u'Fare'], dtype='object')

Ranked List of feature importances:  [('Gender_Male', 1), ('Class', 2), ('Gender_Female', 3), ('Age', 4), ('Fare', 5)]


### Stability Selection
* The high level idea is to apply a feature selection algorithm on different subsets of data and with different subsets of features. After repeating the process a number of times, the selection results can be aggregated, for example by checking how many times a feature ended up being selected as important when it was in an inspected feature subset.
* We can expect strong features to have scores close to 100%, since they are always selected when possible. Weaker, but still relevant features will also have non-zero scores.
* Sklearn implements stability selection in the "randomized lasso" and "randomized logistics regression" classes.
* Stability selection is useful for both pure feature selection to reduce overfitting, but also for data interpretation: in general, good features won’t get 0 as coefficients just because there are similar, correlated features in the dataset (as is the case with lasso). 

In [25]:
from sklearn.linear_model import RandomizedLasso
rlasso = RandomizedLasso(alpha=0.005)
rlasso.fit(X_train_std, y_train)
print "Ranked List of feature importances: ", sorted(zip(df.columns[:-1],rlasso.scores_ ), key=lambda x:x[1], reverse = True)

Ranked List of feature importances:  [('Gender_Female', 0.55500000000000005), ('Class', 0.47999999999999998), ('Gender_Male', 0.255), ('Fare', 0.044999999999999998), ('Age', 0.0)]




* When selecting top features for model performance improvement, it is easy to verify if a particular method works well against alternatives simply by doing cross-validation.
* It’s not as straightforward when using feature ranking for data interpretation, where stability of the ranking method is crucial and a method that doesn’t have this property (such as lasso) could easily lead to incorrect conclusions.
* What can help there is subsampling the data and running the selection algorithms on the subsets. If the results are consistent across the subsets, it is relatively safe to trust the stability of the method on this particular data and therefor straightforward to interpret the data in terms of the ranking.