## Sidenotes (definitions, code snippets, resources, etc.)
- Original in `nd_machine_learning/nd_ml_course_code/projects/boston_housing/`
- Symlinked to `intro_to_ml/ud120-projects/validation/` for lesson 13 in Intro to ML course.

### ML Order of Operations
![order of operations](cross_validation_images/ml_order_of_operations.png)

### Python 3 change
- From Python 3.3, dict keys are iterating through in a random order for each iteration (will alter GridSearchCV's output).
    - See note with validation mini-project for info on coverting code from 2.7 to 3.3.

# Cross Validation
sklearn User Guide [3.1. Cross-validation: evaluating estimator performance](http://scikit-learn.org/stable/modules/cross_validation.html):
- When evaluating different settings (“hyperparameters”) for estimators, such as the `C` setting that must be manually set for an SVM, there is still a risk of overfitting _on the test set_ because the parameters can be tweaked until the estimator performs optimally. 
- This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. 
- To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
- However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
- A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called _k_-fold CV, the training set is split into _k_ smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the _k_ “folds”:
    - A model is trained using `k-1` of the folds as training data;
    - the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
    - The performance measure reported by _k_-fold cross-validation is then the average of the values computed in the loop. 
- This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

## KFold in sklearn
- in sklearn: `sklearn.cross_validation.`[__`KFold()`__](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html)
- Does not randomize data automatically (can cause issues with performance)
- Use keyword argument `shuffle=True` to randomized events.

Example usage from [Cross-validation on diabetes Dataset Exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html):
```python
lasso_cv = linear_model.LassoCV(alphas=alphas)
k_fold = cross_validation.KFold(len(X), 3)

...

for k, (train, test) in enumerate(k_fold):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(k, lasso_cv.alpha_,
                 lasso_cv.score(X[test],
                 y[test])))
```

## GridSearchCV in sklearn
`sklearn.grid_search`.GridSearchCV [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) and [User Guide](http://scikit-learn.org/stable/modules/grid_search.html#grid-search):
- Parameters that are not directly learnt within estimators can be set by searching a parameter space for the best performance score. 
- Typical examples include `C`, `kernel` and `gamma` for Support Vector Classifier, `alph` for Lasso, etc.
- Parameters passed to GridSearchCV in a _parameter space_ are often referred to as _hyperparameters_ (particularly in Bayesian learning), distinguishing them from the parameters optimised in a machine learning procedure.

Example from documentation, explained:
```python
from sklearn import svm, grid_search, datasets
iris = datasets.load_iris()

svr = svm.SVC()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris.data, iris.target)
```
- `parameters` is a dict of different sets of parameters that will be used to train multiple SVM classifiers.
- `svr = svm.SVC()` is passed to the GridSearchCV classifier to indicate what classifier iterate.
- `clf = grid_search.GridSearchCV(svr, parameters)` creates the classifier by generating a 'grid' of SMVs from each of the given combinations of values for (kernel, C).
- `clf.fit(iris.data, iris.target)` iterates through the grid, returning a fitted classifier automatically tuned to the optimal parameter combination. 
    - `clf.best_params_` returns those parameter values.
    - `grid.best_estimator_` returns the optimized estimator

Refer to the eigenfaces code, which you can find here. What parameters of the SVM are being tuned with GridSearchCV?

- _Answer:_ 5 values of C and 6 values of gamma are tested out.

## Mini-project! on validation
You’ll start by building the simplest imaginable (unvalidated) POI identifier. The starter code (validation/validate_poi.py) for this lesson is pretty bare--all it does is read in the data, and format it into lists of labels and features. Create a decision tree classifier (just use the default parameters), train it on all the data (you will fix this in the next part!), and print out the accuracy. THIS IS AN OVERFIT TREE, DO NOT TRUST THIS NUMBER! Nonetheless, what’s the accuracy?

- _Answer:_ 0.98947368421052628. 
    - "Pretty high accuracy, huh?  Yet another case where testing on the training data would make you think you were doing amazingly well, but as you already know, that's exactly what holdout test data is for..."


In [1]:
from validate_poi import *

### it's all yours from here forward! 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
clf = DecisionTreeClassifier()
clf.fit(features, labels)
pred = clf.predict(features)
accuracy_score(pred, labels)

0.98947368421052628

Now you’ll add in training and testing, so that you get a trustworthy accuracy number. Use the train_test_split validation available in sklearn.cross_validation; 
- hold out 30% of the data for testing and 
- set the random_state parameter to 42 (random_state controls which points go into the training set and which are used for testing; setting it to 42 means we know exactly which events are in which set, and can check the results you get). 

What’s your updated accuracy?

- _Answer:_ 0.72413793103448276
    - Properly deployed with "testing data brings us back down to earth after that 99% accuracy in the last quiz."

In [2]:
from validate_poi import *

### it's all yours from here forward! 
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score    
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

0.72413793103448276