## Sidenotes (definitions, code snippets, resources, etc.)
### ML Order of Operations
![order of operations](lesson_13_images/ml_order_of_operations.png)

### Python 3 change
- From Python 3.3, dict keys are iterating through in a random order for each iteration (will alter GridSearchCV's output).
    - See note with validation mini-project for info on coverting code from 2.7 to 3.3.

# Cross Validation
sklearn User Guide [3.1. Cross-validation: evaluating estimator performance](http://scikit-learn.org/stable/modules/cross_validation.html)
- Used to determine optimal split of testing and training data
- see video for explanation

## KFold in sklearn
- Does not randomize data automatically (can cause issues with performance)
- Use keyword argument `shuffle=True` to randomized events.

## GridSearchCV in sklearn
`sklearn.grid_search`.GridSearchCV [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [User Guide](http://scikit-learn.org/stable/modules/grid_search.html#grid-search)

Example from documentation, explained:
```python
from sklearn import svm, grid_search, datasets
iris = datasets.load_iris()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris.data, iris.target)
```
- `parameters` is a dict of different sets of parameters that will be used to train multiple SVM classifiers.
- `svr = svm.SVC()` is passed to the GridSearchCV classifier to indicate what classifier iterate.
- `clf = grid_search.GridSearchCV(svr, parameters)` creates the classifier by generating a 'grid' of SMVs from each of the given combinations of values for (kernel, C).
- `clf.fit(iris.data, iris.target)` iterates through the grid, returning a fitted classifier automatically tuned to the optimal parameter combination. 
    - `clf.best_params_` returns those parameter values.

Refer to the eigenfaces code, which you can find here. What parameters of the SVM are being tuned with GridSearchCV?

- _Answer:_ 5 values of C and 6 values of gamma are tested out.

## Mini-project! on validation
You’ll start by building the simplest imaginable (unvalidated) POI identifier. The starter code (validation/validate_poi.py) for this lesson is pretty bare--all it does is read in the data, and format it into lists of labels and features. Create a decision tree classifier (just use the default parameters), train it on all the data (you will fix this in the next part!), and print out the accuracy. THIS IS AN OVERFIT TREE, DO NOT TRUST THIS NUMBER! Nonetheless, what’s the accuracy?

- _Answer:_ 0.98947368421052628. 
    - "Pretty high accuracy, huh?  Yet another case where testing on the training data would make you think you were doing amazingly well, but as you already know, that's exactly what holdout test data is for..."


In [1]:
from validate_poi import *

### it's all yours from here forward! 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
clf = DecisionTreeClassifier()
clf.fit(features, labels)
pred = clf.predict(features)
accuracy_score(pred, labels)

0.98947368421052628

Now you’ll add in training and testing, so that you get a trustworthy accuracy number. Use the train_test_split validation available in sklearn.cross_validation; 
- hold out 30% of the data for testing and 
- set the random_state parameter to 42 (random_state controls which points go into the training set and which are used for testing; setting it to 42 means we know exactly which events are in which set, and can check the results you get). 

What’s your updated accuracy?

- _Answer:_ 0.72413793103448276
    - Properly deployed with "testing data brings us back down to earth after that 99% accuracy in the last quiz."

In [2]:
from validate_poi import *

### it's all yours from here forward! 
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score    
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

0.72413793103448276