# Feature selection lesson/lab

In this codealong we will explore different ways of performing **feature selection**.

Feature selection is the process of reducing the number of predictors in your data based on their calculated "usefulness". This is the flip side of the process of **feature engineering**, where you create new descriptive predictors from the data you already have.

You have already had experience with the **Lasso**, which (in my opinion), is the best feature selector despite it's downsides. [ASIDE: The downsides of the Lasso are in fact addressed by the "Elastic Net" penalty, which combines the Ridge regularization with the Lasso regularization!]

In the first half we will be revisiting and practicing using the Lasso in a classification task of identifying spam text messages from the appearance of particular words in the text message. This dataset has been created with the `CountVectorizer` that you are familiar with from previous lessons, though you will not be using it here in the interest of time.

The second half you will explore, as groups, alternative methods of doing feature selection provided in the scikit-learn package and then presenting on how they work.

---

In [1]:
import pandas as pd
import numpy as np

### 1.1 Load data

In [2]:
spam = pd.read_csv('../../assets/datasets/spam_words_wide.csv')

In [3]:
spam.head()


Unnamed: 0,is_spam,getzed,86021,babies,sunoco,ultimately,thk,voted,spatula,fiend,...,itna,borin,thoughts,iccha,videochat,freefone,pist,reformat,strict,69698
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
spam.shape

(5572, 1001)

### 1.2 Find baseline rate of spam

In [6]:
print spam.is_spam.value_counts()
spam.is_spam.mean()

0    4825
1     747
Name: is_spam, dtype: int64


0.13406317300789664

---

### 2.1 Cross-validate logistic regression accuracy

Use these classes/methods from scikit-learn:

    LogisticRegression
    cross_val_score
    
Cross-validate the logistic regression with 5 folds.

Also save the X matrix column headers for later.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

In [26]:
target = 'is_spam'
cols = [c for c in spam.columns if c != target]
x = spam[cols]
y = spam[target]
model = LogisticRegression()

In [27]:
scores = cross_val_score(model,x,y,cv=5)

In [28]:
print scores
print np.mean(scores)

[ 0.92735426  0.93363229  0.93177738  0.93806104  0.94344704]
0.934854400979


In [29]:
# CHANGING THRESHOLD!

model.fit(x, y)
pp = model.predict_proba(x)
print pp[0:5]
y_pred_50pct = model.predict(x)
print model.classes_
# Our model is very confident that these points are 0:

[[ 0.97997764  0.02002236]
 [ 0.93243347  0.06756653]
 [ 0.93243347  0.06756653]
 [ 0.97674543  0.02325457]
 [ 0.93243347  0.06756653]]
[0 1]


In [33]:
#from sklearn.metrics import confusion_matrix
#confusion_matrix(y, y_pred_50pct)
Ytrue = pd.Series(y)
Ypred50 = pd.Series(y_pred_50pct)
pd.crosstab(Ytrue, Ypred50, rownames=['True'], colnames=['Predicted'], margins=True)
#print pp[50:53]

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4813,12,4825
1,283,464,747
All,5096,476,5572


In [31]:
y_pred_95pct = [0 if row[0] > 0.95 else 1 for row in pp]
Ypred95 = pd.Series(y_pred_95pct)
pd.crosstab(Ytrue, Ypred95, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1584,3241,4825
1,5,742,747
All,1589,3983,5572


---

### 2.2 Cross-validate logistic regression area under ROC

The `scoring` keyword argument in `cross_val_score()` can take different scoring metrics than the default "accuracy".

For more information on how to use this [read the documentation on model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) particularly in **Section 3.3.1**.

Why is using the area under the ROC curve more informative than the accuracy?

---

### 2.3 Cross-validate the logistic regression, scoring on the area under the precision-recall curve

The "average_precision" is the area under the precision-recall curve, whereas "precision" is simply the precision without taking recall into consideration.

Why/when might you decide to use precision and recall in a classification task (area under precision-recal curve) versus specificity and sensitivity (area under the ROC curve)?

These wikipedia pages are good references for refreshing your memory or figuring out the differences and goals of each:

1. [Confusion matrices](https://en.wikipedia.org/wiki/Confusion_matrix)
2. [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
3. [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
4. [Type I and Type II errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors)

---

### 2.3 Fit the logistic regression on all the data and calculate the number of non-negative coefficients

---

### 3.1 Cross-validate logreg Lasso regularization using the 'average_precision' scoring metric

Use the sklearn class:

    LogisticRegressionCV
    
Remember that these keyword arguments need to be used for the Lasso:

    solver='liblinear'
    penalty='l1' (That is a lowercase 'L' first!)
    
Cross-validate with 25 folds and print out the best regularization parameter C.

---

### 3.2 Build a logistic regression using Lasso penalty with the optimal C and cross-validate area under the precision-recall curve

---

### 3.3 Find how many non-zero coefficients there are for this model

How does this compare to the non-Lasso model? Explain why the Lasso performs "feature selection".

---

### 3.4 Repeat this process but with scoring='precision'

In [None]:
lr_lasso_cv = LogisticRegressionCV(solver='liblinear', cv=5, penalty='l1', Cs=25,
                                   scoring='precision')

lr_lasso_cv.fit(X, Y)

print lr_lasso_cv.C_

lr_lasso = LogisticRegression(solver='liblinear', penalty='l1', C=lr_lasso_cv.C_[0])

scores = cross_val_score(lr_lasso, X, Y, cv=5, scoring='precision')
print np.mean(scores)

lr_lasso.fit(X, Y)

print np.sum(lr_lasso.coef_[0] != 0)

---

### 3.5 Use the X column names and the coefficients from the logistic regression to find out which features were kept when scoring for precision with the Lasso penalty

Explain what these chosen features are useful for when scoring for optimal precision.

Why are there so few compared to the area under the precision-recall curve?

---

### 4. Explore other feature selection methods

scikit-learn comes with [a variety of other feature selection methods](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection).

For the next section you will explore as groups these methods:

**Group 1**
    
    feature_selection.SelectPercentile
    feature_selection.SelectKBest
    
**Group 2**
    
    feature_selection.RFE
    feature_selection.RFECV
    
**Group 3**
    
    feature_selection.SelectFpr
    feature_selection.SelectFdr
    
**Group 4**

    feature_selection.VarianceThreshold
    feature_selection.SelectFwe
    
---

#### Questions for presentation

After exploring the assigned feature selection methods, you will, as a group, present to the class on:

1. How the feature selection method is designed to reduce the number of predictors.
2. What scenario(s) you think the method would be particularly useful in.
3. How to implement the method in code.
4. [BONUS] Possible downsides to using the feature selection method (if any).
