# Feature Selection Lab

In this lab we will explore feature selection on the Titanic Dataset. First of all let's load a few things:

- Standard packages
- The training set from lab 2.3
- The union we have saved in lab 2.3


You can load the titanic data as follows:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

In [15]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

df = pd.read_sql('SELECT * FROM train', engine)

In [16]:
import os
os.getcwd()

'/Users/HudsonCavanagh/GA_dsi-projects/weekly_work/week05'

In [17]:
import gzip
import dill


with gzip.open('/Users/HudsonCavanagh/GA_dsi-projects/weekly_work/week05/assets/datasets/union.dill.gz') as fin:
    union = dill.load(fin)
    
X = df[[u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked']]
y = df[u'Survived']

X_transf = union.fit_transform(X)
X_transf

array([[-0.5924806 ,  0.        ,  0.        , ...,  1.        ,
         1.        , -0.50244517],
       [ 0.63878901,  1.        ,  0.        , ...,  0.        ,
         0.        ,  0.78684529],
       [-0.2846632 ,  0.        ,  0.        , ...,  1.        ,
         0.        , -0.48885426],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  1.        ,
         0.        , -0.17626324],
       [-0.2846632 ,  1.        ,  0.        , ...,  0.        ,
         1.        , -0.04438104],
       [ 0.17706291,  0.        ,  0.        , ...,  0.        ,
         1.        , -0.49237783]])

## 1 Column names

Uh oh, we have lost the column names along the way! We need to manually add them:
- age_pipe => 'scaled_age'
- one_hot_pipe => 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S'
- gender_pipe => 'male'
- fare_pipe => 'scaled_fare'

Now we need to:

1. Create a new pandas dataframe called `Xt` with the appropriate column names and fill it with the `X_transf` data.
2. Notice that the current pipeline complitely discards the columns: u'SibSp', u'Parch'. Stack them as they are to the new dataframe


In [27]:
col_list = ['scaled_age', 'Pclass_1', 'Pclass_2', 'Pclass_3',
            'Embarked_C', 'Embarked_Q', 'Embarked_S',
            'male', 'scaled_fare']

Xt = pd.DataFrame(X_transf, columns=col_list)
Xf = pd.concat([Xt,df[[u'SibSp', u'Parch']]], axis=1)
Xf.head(10)

Unnamed: 0,scaled_age,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,male,scaled_fare,SibSp,Parch
0,-0.592481,0,0,1,0,0,1,1,-0.502445,1,0
1,0.638789,1,0,0,1,0,0,0,0.786845,1,0
2,-0.284663,0,0,1,0,0,1,0,-0.488854,0,0
3,0.407926,1,0,0,0,0,1,0,0.42073,1,0
4,0.407926,0,0,1,0,0,1,1,-0.486337,0,0
5,0.0,0,0,1,0,1,0,1,-0.478116,0,0
6,1.870059,1,0,0,0,0,1,1,0.395814,0,0
7,-2.131568,0,0,1,0,0,1,1,-0.224083,3,1
8,-0.207709,0,0,1,0,0,1,0,-0.424256,0,2
9,-1.208115,0,1,0,1,0,0,0,-0.042956,1,0


## 2. Feature selection

Let's use the `SelectKBest` method in scikit learn to see which are the top 5 features.

- What are the top 5 features for `Xt`?

=> store them in a variable called `kbest_columns`

In [45]:
%matplotlib inline
import sklearn
from sklearn import feature_selection
from scipy.stats import chi2

checkKbest = feature_selection.SelectKBest(k=5)
checkKbest.fit(Xf, y)
Xf5 = checkKbest.transform(Xf)
Xf5_second = checkKbest.fit_transform(Xf, y)

In [46]:
Xf5_second

array([[ 0.        ,  1.        ,  0.        ,  1.        , -0.50244517],
       [ 1.        ,  0.        ,  1.        ,  0.        ,  0.78684529],
       [ 0.        ,  1.        ,  0.        ,  0.        , -0.48885426],
       ..., 
       [ 0.        ,  1.        ,  0.        ,  0.        , -0.17626324],
       [ 1.        ,  0.        ,  1.        ,  1.        , -0.04438104],
       [ 0.        ,  1.        ,  0.        ,  1.        , -0.49237783]])

In [52]:
Kbest_columns = Xf.columns[checkKbest.get_support()]
df_5best = pd.DataFrame(Xf5_second, columns=Kbest_columns)
df_5best

Unnamed: 0,Pclass_1,Pclass_3,Embarked_C,male,scaled_fare
0,0,1,0,1,-0.502445
1,1,0,1,0,0.786845
2,0,1,0,0,-0.488854
3,1,0,0,0,0.420730
4,0,1,0,1,-0.486337
5,0,1,0,1,-0.478116
6,1,0,0,1,0.395814
7,0,1,0,1,-0.224083
8,0,1,0,0,-0.424256
9,0,0,1,0,-0.042956


## 3. Recursive Feature Elimination

`Scikit Learn` also offers recursive feature elimination as a class named `RFECV`. Use it in combination with a logistic regression model to see what features would be kept with this method.

=> store them in a variable called `rfecv_columns`

In [61]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

lr = linear_model.LogisticRegression()
rf_model = feature_selection.RFECV(lr)
rf_model = rf_model.fit(Xf,y)

rfecv_columns = Xf.columns[rf_model.support_]

In [62]:
rf_model.support_

array([ True,  True,  True,  True,  True,  True, False,  True, False,
       False, False], dtype=bool)

In [63]:
rfecv_columns

Index([u'scaled_age', u'Pclass_1', u'Pclass_2', u'Pclass_3', u'Embarked_C',
       u'Embarked_Q', u'male'],
      dtype='object')

## 4. Logistic regression coefficients

Let's see if the Logistic Regression coefficients correspond.

- Create a logistic regression model
- Perform grid search over penalty type and C strength in order to find the best parameters
- Sort the logistic regression coefficients by absolute value. Do the top 5 correspond to those above?
> Answer: Not completely. That could be due to scaling

=> choose which ones you would keep and store them in a variable called `lr_columns`

In [74]:
lr_model = linear_model.LogisticRegression()
lr_model = lr_model.fit(Xf, y)

C_vals = [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']

gs = GridSearchCV(lr_model, {'penalty': penalties, 'C': C_vals}, cv=15)
gs.fit(Xf, y)
gs.best_params_

{'C': 0.1, 'penalty': 'l2'}

In [88]:
# gs.best_params_.coeff_# doesn't work because output of best_params is a dict
cof = gs.best_estimator_.coef_
coef_df = pd.DataFrame(cof, columns = Xf.columns)

# cols_coefs = zip(Xf.columns, cof)
print(coef_df)


   scaled_age  Pclass_1  Pclass_2  Pclass_3  Embarked_C  Embarked_Q  \
0   -0.365079  0.853465  0.351926 -0.550361    0.358697    0.250084   

   Embarked_S     male  scaled_fare     SibSp     Parch  
0   -0.002806 -1.87419     0.222571 -0.230901 -0.010825  


In [107]:
coef_df_t =coef_df.transpose()
coef_df_t.columns = ['coefficient']
ranked_coefs = coef_df_t.abs().sort_values('coefficient', ascending=False)

# lr_columns = coeffs.columns[(coeffs.abs() > 0.3).values[0]] ##key - this was tricky for me
coef_gs_cols = coef_df.columns[(coef_df.abs() > 0.3).values[0]]
coef_gs_cols

Index([u'scaled_age', u'Pclass_1', u'Pclass_2', u'Pclass_3', u'Embarked_C',
       u'male'],
      dtype='object')

In [79]:
gs_lr_model = linear_model.LogisticRegression(gs.best_estimator_)

# LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)

## 5. Compare features sets

Use the `best estimator` from question 4 on the 3 different feature sets:

- `kbest_columns`
- `rfecv_columns`
- `lr_columns`
- `all_columns`

Questions:

- Which scores the highest? (use cross_val_score)
- Is the difference significant?
> Answer: Not really
- discuss in pairs

In [118]:
from sklearn import cross_validation
all_columns = Xf.columns
Kbest_columns
rfecv_columns
coef_gs_cols

list_of_cols = ['all_columns', 'Kbest_columns', 'rfecv_columns', 'coef_gs_cols']


print("all columns cross val:", cross_validation.cross_val_score(gs.best_estimator_, Xf[all_columns], y).mean(),
     "Kbest columns cross val:", cross_validation.cross_val_score(gs.best_estimator_, Xf[Kbest_columns], y).mean(),
     "recursive columns cross val:", cross_validation.cross_val_score(gs.best_estimator_, Xf[rfecv_columns], y).mean(),
     "GS logistic columns cross val:", cross_validation.cross_val_score(gs.best_estimator_, Xf[coef_gs_cols], y).mean())

# Weirdly my recursive columns were the best

print("recursive:", rfecv_columns, "logistic gs:", coef_gs_cols) #only difference is the latter drops Embarked_Q

('all columns cross val:', 0.79236812570145909, 'Kbest columns cross val:', 0.76206509539842882, 'recursive columns cross val:', 0.78451178451178449, 'GS logistic columns cross val:', 0.77890011223344569)
('recursive:', Index([u'scaled_age', u'Pclass_1', u'Pclass_2', u'Pclass_3', u'Embarked_C',
       u'Embarked_Q', u'male'],
      dtype='object'), 'logistic gs:', Index([u'scaled_age', u'Pclass_1', u'Pclass_2', u'Pclass_3', u'Embarked_C',
       u'male'],
      dtype='object'))


## Bonus

Use a bar chart to display the logistic regression coefficients. Start from the most negative on the left.