# Feature selection lesson/lab

In this codealong we will explore different ways of performing **feature selection**.

Feature selection is the process of reducing the number of predictors in your data based on their calculated "usefulness". This is the flip side of the process of **feature engineering**, where you create new descriptive predictors from the data you already have.

You have already had experience with the **Lasso**, which (in my opinion), is the best feature selector despite it's downsides. [ASIDE: The downsides of the Lasso are in fact addressed by the "Elastic Net" penalty, which combines the Ridge regularization with the Lasso regularization!]

In the first half we will be revisiting and practicing using the Lasso in a classification task of identifying spam text messages from the appearance of particular words in the text message. This dataset has been created with the `CountVectorizer` that you are familiar with from previous lessons, though you will not be using it here in the interest of time.

The second half you will explore, as groups, alternative methods of doing feature selection provided in the scikit-learn package and then presenting on how they work.

---

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

### 1.1 Load data

In [2]:
spam = pd.read_csv('../../assets/datasets/spam_words_wide.csv')

In [3]:
spam.shape

(5572, 1001)

In [4]:
spam.head()

Unnamed: 0,is_spam,getzed,86021,babies,sunoco,ultimately,thk,voted,spatula,fiend,...,itna,borin,thoughts,iccha,videochat,freefone,pist,reformat,strict,69698
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 1.2 Find baseline rate of spam

In [5]:
print np.mean(spam.is_spam)

0.134063173008


---

### 2.1 Cross-validate logistic regression accuracy

Save the X matrix column headers for later.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

In [7]:
logreg = LogisticRegression()

X_cols = spam.columns[1:]

X = spam[X_cols].values
Y = spam.is_spam.values


In [8]:
cross_val_score(logreg, X, Y, cv=5)

array([ 0.92735426,  0.93363229,  0.93177738,  0.93806104,  0.94344704])

---

### 2.2 Cross-validate logistic regression area under ROC with 'roc_auc'

In [9]:
scores = cross_val_score(logreg, X, Y, cv=5, scoring='roc_auc')
print np.mean(scores)

0.89977411181


---

### 2.3 Cross-validate LR area under precision recall curve with 'average_precision'

In [10]:
scores = cross_val_score(logreg, X, Y, cv=5, scoring='average_precision')
print np.mean(scores)

0.8120150667


---

### 2.3 Fit on full data and find number of non-negative coefficients

In [11]:
logreg.fit(X, Y)
np.sum(logreg.coef_[0] != 0)

1000

---

### 3.1 Cross-validate logreg Lasso regularization using the 'average_precision' scoring metric

In [12]:
from sklearn.linear_model import LogisticRegressionCV

In [13]:
lr_lasso_cv = LogisticRegressionCV(solver='liblinear', cv=5, penalty='l1', Cs=25,
                                   scoring='average_precision')

In [14]:
lr_lasso_cv.fit(X, Y)

LogisticRegressionCV(Cs=25, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring='average_precision', solver='liblinear',
           tol=0.0001, verbose=0)

In [15]:
lr_lasso_cv.C_

array([ 4.64158883])

### 3.2 Build a Lasso LR with the above C and cross-validate area under precision recall curve

In [16]:
lr_lasso = LogisticRegression(solver='liblinear', penalty='l1', C=lr_lasso_cv.C_[0])

In [17]:
scores = cross_val_score(lr_lasso, X, Y, cv=5, scoring='average_precision')
print np.mean(scores)

0.822465579725


In [18]:
lr_lasso.fit(X, Y)

LogisticRegression(C=4.6415888336127722, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

### 3.3 Find how many non-zero coefficients there are for this model

In [19]:
np.sum(lr_lasso.coef_[0] != 0)

285

### 3.4 Repeat this process but with scoring='precision'

In [20]:
lr_lasso_cv = LogisticRegressionCV(solver='liblinear', cv=5, penalty='l1', Cs=25,
                                   scoring='precision')

lr_lasso_cv.fit(X, Y)

print lr_lasso_cv.C_

lr_lasso = LogisticRegression(solver='liblinear', penalty='l1', C=lr_lasso_cv.C_[0])

scores = cross_val_score(lr_lasso, X, Y, cv=5, scoring='precision')
print np.mean(scores)

lr_lasso.fit(X, Y)

print np.sum(lr_lasso.coef_[0] != 0)

[ 0.1]
0.943116584027
18


  'precision', 'predicted', average, warn_for)


---

### 3.5 Use the X column matrix to find out which features were kept for precision

In [21]:
precision_coefs = pd.DataFrame({'feature':X_cols,
                                'coef':lr_lasso.coef_[0]})

In [22]:
precision_coefs = precision_coefs[precision_coefs.coef != 0]

In [23]:
precision_coefs

Unnamed: 0,coef,feature
39,1.190993,delivery
106,1.694938,mob
170,1.082155,send
204,0.795876,network
206,3.068542,mobile
250,3.169775,prize
448,1.450052,urgent
469,0.382425,2lands
525,-0.321725,good
530,1.282592,http


---

### 4. Explore other feature selection methods

scikit-learn comes with [a variety of other feature selection methods](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection).

For the next section you will explore as groups these methods:

**Group 1**
    
    feature_selection.SelectPercentile
    feature_selection.SelectKBest
    
**Group 2**
    
    feature_selection.RFE
    feature_selection.RFECV
    
**Group 3**
    
    feature_selection.SelectFpr
    feature_selection.SelectFdr
    
**Group 4**

    feature_selection.VarianceThreshold
    feature_selection.SelectFwe
    
---

#### Expectations

After exploring the assigned feature selection methods, you will, as a group, present to the class on:

1. How the feature selection method is designed to reduce the number of predictors.
2. What scenario(s) you think the method would be particularly useful in.
3. How to implement the method in code.
4. [BONUS] Possible downsides to using the feature selection method (if any).
