In [79]:
%%html
<link rel="stylesheet" href="static/hyrule.css" type="text/css">

# Feature Selection and Feature Reduction

## Objectives

* Review core concepts around why when and how to select features
* Using machine learning models to decide features to keep
* How data transformations like PCA reduce your data but maintain variance

## Class Notes

We'll primarily be concerned with two sklearn modules today: 

* A deeper dive into [feature_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)
* Exploring use cases for [dimensionality reduction](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition)

### What makes a good feature?

In the effort of simplicity and generality, good, strong features can be described to have the following attributes:

* **high variance**: a feature with only one value is significantly less useful than one with many values
* **correlation**: a positive or negative relationship with a target variable; low correlation suggests a lack of that relationship. (Keep in mind that correlation doesn't always mean causation either)
* **predictive power**: features with large coefficients and importances are great; features with close to 0 coefficients, not so great

### What makes a good model?

* **good features!**
* and.... **simplicity**

<img src='img/overfitting.png' />
_while we haven't used polynomials, there's still a balance for our models between simplicity and feature dependence_

#### Why?
We should aim to keep our models as simple as possible in order to attribute the most gain.  
Simple models are much easier to understand as well

### How do we reduce the number of features in our data?

There's a number of techniques available in sklearn that automate these processes for us:

sklearn_helper | technique
---------------|----------
`VarianceThreshold` | Remove features with low variance, based on a tolerance level
`SelectKBest` | Select the best group of correlated features using `feature_selection` tools. K (as usual) is something you search for and define.
`L1 and Trees` | using fit_transform on any supervised learning algorithm that has it can drop features with low coefficients or importances.

While SKlearn also has a `pipeline` module to _further_ automate this process for you, it is more recommended to explore the data first to get a sense of what you are working with. There's no magic button that says "solve my problem," but if you are interested in automating a model fit (say, a nightly procedue on a deployed model with constantly updated data), then it might be something worth exploring. 

For each below we'll work through Iris and notice how it picks out the best features for us. We'll use iris because the data is well scaled (which otherwise requires finetuning) and relatively predictive (we know there are features more predictive than others).

For each code sample below:

1. Review what the code is doing. Consider opening up the help function or reading the documentation on sklearn.
2. find the `.shape` of the new array returned and compare to the original dataset. What columns did it end up keeping, vs removing?
3. Adjust the parameters. Do results change?
4. ** \* **These are all considered data preprocessing steps. In your final project, what and where might you consider adding one of these processes?

In [62]:
import pandas as pd
def make_irisdf():
    from sklearn.datasets import load_iris
    from pandas import DataFrame
    iris = load_iris()
    df = DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target
    return df

iris = make_irisdf()

In [9]:
from sklearn import feature_selection

#### `VarianceThreshold`

Goals:

1. What is variance?
2. How does changing the threshold change the fit_transform?

In [30]:
print iris.ix[:,:4].apply(lambda x: x.var())
print iris.ix[:,:4].head()
print feature_selection.VarianceThreshold(threshold=.6).fit_transform(iris.ix[:,:4])[:5]

sepal length (cm)    0.685694
sepal width (cm)     0.188004
petal length (cm)    3.113179
petal width (cm)     0.582414
dtype: float64
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
[[ 5.1  1.4]
 [ 4.9  1.4]
 [ 4.7  1.3]
 [ 4.6  1.5]
 [ 5.   1.4]]


#### `SelectKBest`
Goals:

1. while f test and chi2 are different tests, are the results the same?
2. How might you solve for k?

_math sidebar:_

$X^2 = \dfrac{(O-E)^2}{E}$<br />
O = observed frequencies<br />
E = expected frequencies<br />

In [74]:
print iris.ix[:,:4].head()
ftest = feature_selection.SelectKBest(score_func=feature_selection.f_classif, k=3)
print pd.Series(ftest.fit(iris.ix[:,:4], iris['target']).scores_, index=iris.ix[:,:4].columns)
print ftest.fit_transform(iris.ix[:,:4], iris['target'])[:5]

chi = feature_selection.SelectKBest(score_func=feature_selection.chi2, k=3)
print pd.Series(chi.fit(iris.ix[:,:4], iris['target']).scores_, index=iris.ix[:,:4].columns)
print chi.fit_transform(iris.ix[:,:4], iris['target'])[:5]

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
sepal length (cm)     119.264502
sepal width (cm)       47.364461
petal length (cm)    1179.034328
petal width (cm)      959.324406
dtype: float64
[[ 5.1  1.4  0.2]
 [ 4.9  1.4  0.2]
 [ 4.7  1.3  0.2]
 [ 4.6  1.5  0.2]
 [ 5.   1.4  0.2]]
sepal length (cm)     10.817821
sepal width (cm)       3.594499
petal length (cm)    116.169847
petal width (cm)      67.244828
dtype: float64
[[ 5.1  1.4  0.2]
 [ 4.9  1.4  0.2]
 [ 4.7  1.3  0.2]
 [ 4.6  1.5  0.2]
 [ 5.   1.4  0.2]]


#### `LogisticRegression`
Goals:

1. How is L1 deciding to keep features?
2. How does changing C change the fit_transform results?

In [69]:
from sklearn import linear_model as lm
clf = lm.LogisticRegression(penalty='L1', C=0.1)
print iris.ix[:,:4].head()

print pd.DataFrame(clf.fit(iris.ix[:,:4], iris['target']).coef_, columns=iris.ix[:,:4].columns)
print clf.fit_transform(iris.ix[:,:4], iris['target'])[:5]


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0           0.000000           1.12443          -1.344510                 0
1           0.000000          -0.38636           0.122739                 0
2          -0.987896           0.00000           1.277061                 0
[[ 3.5  1.4]
 [ 3.   1.4]
 [ 3.2  1.3]
 [ 3.1  1.5]
 [ 3.6  1.4]]


#### `DecisionTreeClassifier`
Goals:

1. What is Gini Importance?
2. How does fit_transform decide what features to keep?
3. How does changing the tree depth (or other preprocessing tools) change the result?

In [78]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=4)
print iris.ix[:,:4].head()
print pd.Series(clf.fit(iris.ix[:,:4], iris['target']).feature_importances_, index=iris.ix[:,:4].columns)
print clf.fit_transform(iris.ix[:,:4], iris['target'])[:5]

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
sepal length (cm)    0.013514
sepal width (cm)     0.000000
petal length (cm)    0.558165
petal width (cm)     0.428322
dtype: float64
[[ 0.2]
 [ 0.2]
 [ 0.2]
 [ 0.2]
 [ 0.2]]


### What if we believe there are hidden features in our data?
_I don't want to get rid of them!_

Then Principal Component Analysis to the rescue!

<!-- bah html coding! -->
<h4> What is principal component analysis? </h4>
<em>comparing how we measure residuals in regressions vs noise in pca</em>
<div style='text-align: center; margin: auto;'>
<img style='float: left;' src='img/ols.png' width='420' / >
<img style='float: left;' src='img/pca.jpg' width='420'/ >
</div>
<br style='clear: both;'/>
<div>
    <div style='float: left;'>
        <h3>Consider Broadway...</h3>
        <p>Manhattan is built on a grid system, with the exception of a couple key points:</p>
        <ul>
            <li>West Village (that's its own story)</li>
            <li>Broadway</li>
        </ul>
        <p>If we needed to get from Harold Square to Eataly, what is easier to explain?</p>
        <ol>
            <li>Walk down 6th avenue until 24th street and then walk east until the park</li>
            <li>Walk down Broadway until you get to Eataly at 24th</li>
        </ol>
        <p>Why is that one easier to explain?</p>
    </div>
    <div style='float: left;'>
        <img src='img/flatiron.png' width='310' />
    </div>
</div>
<br style='clear: both;'/>

#### When should we use it?
PCA is a common technique already used in your day to day:

* compressing images or files
* Want to reduce computational expense
* Recognition (signal processing, speach, computer vision)
* Bioinformatics (microarray analysis)

#### How does it work?

* review variance (from above)
* explain covariance
* projection of space
* signal vs noise

examples: walk through two very simple matrices

#### How do we evaluate the performance of it?

1. It's linear!! (Hey... almost everything we do is linear!)
2. Principal Components are sorted from most explanatory to least explanatory
3. We want to maximize the variance explained... and cut off the least informative part
4. What did we learn about that would help solve for this?

#### How do we explain it?

1. This isn't "made up data." Use your original features to help explain the relationships.
2. instead of using rules, we can use something else to explain relationships: what? (correlations)
3. Did we have features that were highly correlated? This may help us understand.
4. The order of principal components is also related to the order of explained covariance; Therefore, PC1 would likely mostly be related to the features mostly correlated.

#### What if we need a non linear solution?

Kernels