# Feature Selection Lab

In this lab we will explore feature selection on the Titanic Dataset. First of all let's load a few things:

- Standard packages
- The training set from lab 2.3
- The union we have saved in lab 2.3


You can load the titanic data as follows:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

In [3]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

df = pd.read_sql('SELECT * FROM train', engine)

In [5]:
print df.shape
df.head()

(891, 13)


Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Feature selection

Let's use the `SelectKBest` method in scikit learn to see which are the top 5 features.

- What are the top 5 features for `Xt`?

=> store them in a variable called `kbest_columns`

In [6]:
from sklearn.feature_selection import SelectKBest

kbest = SelectKBest(k=5)
kbest

SelectKBest(k=5, score_func=<function f_classif at 0x7f55837de230>)

In [8]:
y = df['Survived']
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Pclass - dummy

Sex - Tranform to dummy

Age

Embarked - dummy

SipSp/Parch

Fare

In [12]:
pclass_dummies = pd.get_dummies(df['Pclass'])
pclass_dummies = pclass_dummies[[1, 2]]
pclass_dummies.columns = ['pclass=1', 'pclass=2']
pclass_dummies.head()

Unnamed: 0,pclass=1,pclass=2
0,0.0,0.0
1,1.0,0.0
2,0.0,0.0
3,1.0,0.0
4,0.0,0.0


In [13]:
male = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
male.head()

0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

In [14]:
embarked_dummies = pd.get_dummies(df['Embarked'], prefix='embarked')
embarked_dummies = embarked_dummies[['embarked_C', 'embarked_Q']]
embarked_dummies.head()

Unnamed: 0,embarked_C,embarked_Q
0,0.0,0.0
1,1.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


In [25]:
analytic_df = pclass_dummies.join(male)
analytic_df = analytic_df.join(df[['Age', 'SibSp', 'Parch']])
analytic_df = analytic_df.join(embarked_dummies)
analytic_df['Child'] = analytic_df['Age'].apply(lambda x: 1 if x < 12 else 0)
analytic_df['Old Person'] = analytic_df['Age'].apply(lambda x: 1 if x > 50 else 0)
analytic_df.head()

Unnamed: 0,pclass=1,pclass=2,Sex,Age,SibSp,Parch,embarked_C,embarked_Q,Child,Old Person
0,0.0,0.0,1,22.0,1,0,0.0,0.0,0,0
1,1.0,0.0,0,38.0,1,0,1.0,0.0,0,0
2,0.0,0.0,0,26.0,0,0,0.0,0.0,0,0
3,1.0,0.0,0,35.0,1,0,0.0,0.0,0,0
4,0.0,0.0,1,35.0,0,0,0.0,0.0,0,0


In [26]:
print analytic_df.shape
analytic_df.dropna(inplace=True)
print analytic_df.shape

(891, 10)
(714, 10)


In [32]:
drop_y = analytic_df.join(y, how='left')
print drop_y.shape
y = drop_y['Survived']
x = analytic_df
print y.shape
print x.shape

(714, 11)
(714,)
(714, 10)


In [33]:
kbest5 = kbest.fit_transform(x, y)
kbest5

array([[ 0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       ..., 
       [ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.]])

In [37]:
kbest.get_params()

{'k': 5,
 'score_func': <function sklearn.feature_selection.univariate_selection.f_classif>}

In [42]:
kbest.scores_

array([  7.13662189e+01,   5.15136665e+00,   2.91287485e+02,
         4.27119493e+00,   2.14599289e-01,   6.25460704e+00,
         2.77276895e+01,   1.75232436e+00,   8.81191517e+00,
         1.13399986e+00])

In [39]:
kbest.pvalues_

array([  1.66459349e-16,   2.35263015e-02,   5.22470993e-55,
         3.91246540e-02,   6.43327731e-01,   1.26106500e-02,
         1.85139258e-07,   1.86009645e-01,   3.09315297e-03,
         2.87284760e-01])

In [40]:
kbest._get_param_names()

['k', 'score_func']

## 3. Recursive Feature Elimination

`Scikit Learn` also offers recursive feature elimination as a class named `RFECV`. Use it in combination with a logistic regression model to see what features would be kept with this method.

=> store them in a variable called `rfecv_columns`

In [49]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.feature_selection import RFE

logistic_regression_factory = LogisticRegression()

rfe_factory = RFE(estimator=logistic_regression_factory, step=1)

In [53]:
# Look at codealong
results_of_rfe = rfe_factory.fit(x, y)
results_of_rfe

RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
  estimator_params=None, n_features_to_select=None, step=1, verbose=0)

## 4. Logistic regression coefficients

Let's see if the Logistic Regression coefficients correspond.

- Create a logistic regression model
- Perform grid search over penalty type and C strength in order to find the best parameters
- Sort the logistic regression coefficients by absolute value. Do the top 5 correspond to those above?

=> choose which ones you would keep and store them in a variable called `lr_columns`

## 5. Compare features sets

Use the `best estimator` from question 4 on the 3 different feature sets:

- `kbest_columns`
- `rfecv_columns`
- `lr_columns`
- `all_columns`

Questions:

- Which scores the highest? (use cross_val_score)
- Is the difference significant?
- discuss in pairs

## Bonus

Use a bar chart to display the logistic regression coefficients. Start from the most negative on the left.