## Sidenotes (definitions, code snippets, resources, etc.)
- Note on data structure: list
    - empty list has a truth value of false
- [Feature Selection with scikit-learn for intro_to_ml](http://napitupulu-jon.appspot.com/posts/feature-selection-ud120.html)
    - Looks very helpful for copying notes, course materials
    - Investigate meaning of `# %%writefile new_enron_feature.py` inserted at top of edited studentMain.py module

### ML Algorithms
- A classic way to overfit an algorithm is by using lots of features and not a lot of training data.
- _Decision Trees_ are easy to overfit.

### Python 2
```python
### there can be many "to" emails, but only one "from", so the
### "to" processing needs to be a little more complicated
# uses counter for iterating through, duplicates process for cc_emails
#   does not seem very pythonic, but maybe clearest method
if to_emails:
    ctr = 0  # counter for iterating through, perhaps not pythonic
    while not to_poi and ctr < len(to_emails):
        if to_emails[ctr] in poi_email_list:
            to_poi = True
        ctr += 1
```

### Useful git code snippets
- `git reset --soft HEAD~`
    - Leaves working tree as it was before git commit


### Techniques


### IPython functions


## Why Feature Selection?
"Make everything as simple as possible, but no simpler" - Albert Einstein

Two major things aspects:
1. Select best features, leaving out unecessary data
- Adding new features to explore data, using intuition

## Process of exploring data with Enron corpus example
- Quiz 1: Coding up a new feature
- Quiz 2: Scaling this feature

### Steps in Katie's process:
- Use her human intuition
    - Quiz 1: POIs might send other POIs emails more often than the general population
    - Quiz 2: Feature scaling might provide more useful visualization
- Code up the new feature 
    - Quiz 1: Calculate aggregate of emails sent from POIs to each person
    - Quiz 2: Scale by total number of messages to and from that person
- Visualize
    - Quiz 1: See scatter plot below to check whether feature gives discriminating power between POIs and non-POIs.
- Repeat
    - Improve preceeding parts of process
    - Zero in on the feature that would be most helpful
    
__Visualizations:__

_Quiz 1:_ Not very useful without scaling
![quiz1 scatterplot](lesson_11_images/quiz1_scatter.png)


### Example of a buggy feature
When Katie was working on the Enron POI identifier, she engineered a feature that identified when a given person was on the same email as a POI. So for example, if Ken Lay and Katie Malone are both recipients of the same email message, then Katie Malone should have her "shared receipt" feature incremented. If she shares lots of emails with POIs, maybe she's a POI herself.

Here's the problem: there was a subtle bug, that Ken Lay's "shared receipt" counter would also be incremented when this happens. And of course, then Ken Lay always shares receipt with a POI, because he is a POI. So the "shared receipt" feature became extremely powerful in finding POIs, because it effectively was encoding the label for each person as a feature.

We found this first by being suspicious of a classifier that was always returning 100% accuracy. Then we removed features one at a time, and found that this feature was driving all the performance. Then, digging back through the feature code, we found the bug outlined above. We changed the code so that a person's "shared receipt" feature was only incremented if there was a different POI who received the email, reran the code, and tried again. The accuracy dropped to a more reasonable level.

We take a couple of lessons from this:
- Anyone can make mistakes--be skeptical of your results!
- 100% accuracy should generally make you suspicious. Extraordinary claims require extraordinary proof.
- If there's a feature that tracks your labels a little too closely, it's very likely a bug!
- If you're sure it's not a bug, you probably don't need machine learning--you can just use that feature alone to assign labels.

## Ignoring features
### Reasons to ignore a feature
- Too much noise, hard to dinstinguish whether it is reliably measuring what you want it to be measuring, i.e. data is not accurate/reliable enough.
- Causes overfitting for some reason
- Highly correlated with/strong related to a feature that is already present, breaking the model because mathematics stops working.
- Unecessarily slows down training/testing process when feature is clearly not useful.

## Features != Information
__definition:__ Features vs. Information
- A feature is a characteristic of particular data point that is attempting to _access_ information
    - In general, we want bare minimum number of features that give as much information as possible
- Can think of features as quantity vs. information as quality.

## Feature selection tools in sklearn
- Feature reduction a.k.a dimensionality reduction
- Very important to be skeptical of features, esp. with high _dimensionality data_
- In example in tools/email_preprocess.py described below, 90% of features we ignored with:
    - insignificant impact on the classifier's accuracy, and 
    - performance improved in terms of the _time complexity_ of the classifier algorithm.

### Univariate Feature Selection
There are several go-to methods of automatically selecting your features in sklearn. Many of them fall under the umbrella of univariate feature selection, which treats each feature independently and asks how much _power_ it gives you in classifying or regressing.

There are two big univariate feature selection tools in sklearn:
- `SelectPercentile` and `SelectKBest`. 
- The difference is pretty apparent by the names:
    - SelectPercentile selects the X% of features that are most powerful (where X is a parameter)
    - SelectKBest selects K number of features that are most powerful (where K is a parameter).

A clear candidate for feature reduction is text learning, since _the data has such high dimension_. We actually did feature selection in the Sara/Chris email classification problem during the first few mini-projects; you can see it in the code in tools/email_preprocess.py:
```
from sklearn.feature_selection import SelectPercentile, f_classif
```
...

```
### feature selection, because text is super high dimensional and 
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(
    features_train_transformed).toarray()
features_test_transformed  = selector.transform(
    features_test_transformed).toarray()
```

### Feature Selection in TfIdf Vectorizer
- NOTE: Usually not a good idea to mix univariate feature selection and Feature Selection with TfIdf Vectorizer parameter `max_df`

Example from tools/email_preprocess.py:
```
### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                               stop_words='english')
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed  = vectorizer.transform(features_test)
```
"`max_df=0.5`" parameter means that words with a document frequency of more than 0.5 will be removed.
- i.e. words that occur in more than 50% of the documents are not included as features
- used because words that are probably too common to be very 'powerful', does not provide 'access' to information.

## Bias, Variance, And Number Of Features
__definition:__ Bias vs. Variance
- High bias algorithm pays _too little_ attention to data, not effected much by data (i.e. oversimplified)
    - characterized by _high error_ on training set, i.e. low r<sup>2</sup> / large SSE or sum of squared (residual) errors
    - _common when:_ too few features used in model
- High variance algorithm pays _too much_ attention to data, does not generalize well, i.e. it overfits to the data
    - characterized by _much higher error_ on test set than training (some variance between the two is expected).
    - _common when:_ carefully minimized SSE, with too many features i.e. overfit to data

__trade-off:__ Quality of model vs. no. of features:

Goal is to balance:
- the performance/accuracy of the model on the training data
    - (without overfitting)
- with as few features as possible.
    - (while maintaining algorithm's performance)
    
### Visualizing Overfitting
#### An Overfit Regression
- Blue points are training data
- Red points are test data
![An Overfit Regression](lesson_11_images/overfit_regression.png)

### Regularization
__definition:__ automatically penalizing extra features in model

![Regularization graph](lesson_11_images/regularization_graph.png)

#### Lasso Regression (type of regularized regression)
`sklearn.linear_model`.Lasso [Documentation][lasso_doc] and [User Guide][lasso_user]
- Mathematical optimization of the bias-variance trade-off
- Minimizes SSE like basic regression
- But also minimizes term (penalty parameter * coefficients of regression)
    - coefficients of regression describe/are related to no. of features
    - if feature does not add enough precision, it's coefficient is set to 0
        - simpler algorithm with this method since it can run through the formula in-place.
- This means that any loss from an additional feature must be offset by the gain in precision by adding that feature.

   
![Lasso regression formula](lesson_11_images/lasso_regression_formula.png)
![Lasso regression penalty method](lesson_11_images/lasso_regression_penalty_method.png)

[lasso_doc]: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

[lasso_user]: http://scikit-learn.org/stable/modules/linear_model.html#lasso

## Mini-project! on feature selection
Katie explained in a video a problem that arose in preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.
If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low?

- Answer: Pretty low (accuracy on training set would be way higher)

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code in feature_selection/find_signature.py. Get a decision tree up and training on the training data, and print out the accuracy. How many training points are there, according to the starter code?

- Answer: 