<img src='img/logo.png'>
<img src='img/title.png'>

# Table of Contents
* [Classification](#Classification)
	* [Logistic regression](#Logistic-regression)
	* [*k*-nearest neighbors](#*k*-nearest-neighbors)
* [Cross-validation](#Cross-validation)
	* [Cross-validation in scikit-learn](#Cross-validation-in-scikit-learn)
	* [Stratified K-Fold cross-validation and other strategies](#Stratified-K-Fold-cross-validation-and-other-strategies)
	* [More control over cross-validation](#More-control-over-cross-validation)
		* [Leave-One-Out cross-validation](#Leave-One-Out-cross-validation)
		* [Shuffle-Split cross-validation](#Shuffle-Split-cross-validation)
* [Grid Search](#Grid-Search)
	* [Simple Grid-Search](#Simple-Grid-Search)
	* [The danger of overfitting the parameters and the validation set](#The-danger-of-overfitting-the-parameters-and-the-validation-set)
	* [Grid-search with cross-validation](#Grid-search-with-cross-validation)
		* [`sklearn.model_selection.GridSearchCV`](#sklearn.model_selection.GridSearchCV)
	* [Analyzing the result of cross-validation](#Analyzing-the-result-of-cross-validation)
	* [Nested cross-validation](#Nested-cross-validation)
* [Summary](#Summary)


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['image.interpolation'] = "none"
np.set_printoptions(precision=3)
import matplotlib as mpl
mpl.rcParams['legend.numpoints'] = 1
import seaborn as sns

import sys
import src.mglearn as mglearn

# Classification

Classification models predict the association of observation to discrete labels.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
print(iris['DESCR'][:471])

In [None]:
sns.pairplot(pd.read_csv('data/iris.csv'), hue='species')

## Logistic regression

<div class='alert alert-info'>
<img src='./img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<big><big><br>
LogisticRegression models predict probabilities for association to discrete labels.
<br><br>
</big></big>
</div>

Logistic regression is conceptually similar to linear regression.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

Splitting a data set into separate sets of data for training and testing (validation) allows evaluation of model fit on data outside the training inputs.

The next cells shows a single split using `train_test_split` from [sklearn.model_selection](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
X = iris.data
y = iris.target

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We predicted the correct class on 87% of the samples in X_test

In [None]:
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

In [None]:
logreg.predict(X_test)[:5]

Let's look at the individual probabilities with each test observation. The highest probably is returned by `.predict()`.

In [None]:
pd.DataFrame(logreg.predict_proba(X_test), columns=[iris.target_names]).head()

## *k*-nearest neighbors

<div class='alert alert-info'>
<img src='./img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<big><big><br>
KNN predicts new observations only by considering proximity to training set
<br><br>
</big></big>
</div>

The more neighbors that are considered the more *general* the model is. Considering only 1 nearest neighbor is likely *overfitting*.

The KNN classifier is considered a good exploratory model but may not be heavily used in production.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Considering three neighbors in 4-dimensional space leads to 97% accuracy.

We got lucky that we didn't run into the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

In [None]:
knn3 = KNeighborsClassifier(n_neighbors=3)

knn3.fit(X_train, y_train)
knn3.score(X_test, y_test)

This model was much more decisive

In [None]:
pd.DataFrame(knn3.predict_proba(X_test), columns=[iris.target_names]).head()

# Cross-validation

What happens to our score if:
- we retain more or less rows in `train_test_split`?
- we call `train_test_split` with a different `random_state`?

A good model will have a stable score under each of the above conditions.

<div class='alert alert-info'>
<img src='./img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<big><big><br>
Cross validation fold: a random single split of training and testing data
<br><br>
</big></big>
</div>

The following graphic shows the idea of a 5-fold cross validation.  The data are divided randomly into 5 groups (folds) and the model is trained in 5 rounds, where each round trains with 4 data sets then calculates model scores based on the 5th set within each fold.

In [None]:
mglearn.plots.plot_cross_validation()

## Cross-validation in scikit-learn

`sklearn.model_selection.cross_val_score` below with default arguments is doing 3 cross validation folds.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logreg, iris.data, iris.target)

The model is very stable when choosing 3 folds with 1/3 of the data retained for validation in each fold

In [None]:
scores

In [None]:
# Changing to 5 folds
scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
scores

In [None]:
# Summarize scores
scores.mean()

## Stratified K-Fold cross-validation and other strategies

Stratification in sampling can be useful when class membership in the input data are not uniformly distributed.  Some example problems where stratified sampling may be helpful include:
 * Fraud detection (find a few fraudsters out of many ok customers)
 * Health sciences (predict presence / absence of uncommon condition)
 * Reliability engineering (few failing samples and many ok samples)
 
This [lecture from John Hopkins University](http://ocw.jhsph.edu/courses/statmethodsforsamplesurveys/PDFs/Lecture4.pdf) provides an overview of stratified sampling in health sciences.

Notice that the observations are ordered by species.

Luckily, `cross_val_score` didn't grab rows sequentially!

In [None]:
iris.target

In [None]:
mglearn.plots.plot_stratified_cross_validation()

## More control over cross-validation

Use `KFold` and `StratifiedKFold` to re-use the folding or control cross validation parameters.

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold

Retain 1/5 for validation

In [None]:
kfold = KFold(5 , random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)

Retain 1/3 for validation, which mean a whole species was ignored.

In [None]:
kfold = KFold(3, random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)

Take 1/9 from each species per fold.

For classification the default behavior of `cross_val_score` is to *stratify*

In [None]:
kfold = StratifiedKFold(3, random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)

### Leave-One-Out cross-validation

Make *N-1* folds with only one row left for validation

In [None]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
print("number of cv iterations: ", len(scores))
print("mean accuracy: ", scores.mean())

### Shuffle-Split cross-validation

[ShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) guarantees all samples will be different.

In [None]:
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)

# Grid Search

Grid search can eliminate hand tuning of model parameters by applying a range of values for each calibrated parameter.  The limitation is that the search space can become large.  Using a grid search approach requires having some model evaluation statistic, such as accuracy or __R<sup>2</sup>__ score.

## Simple Grid-Search

Here's an example of how a custom grid search for model parameters could be done: trying a variety of `gamma` and `C` parameters to a support vector classifier (`SVC`).

In [None]:
# naive grid search implementation
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
print("Size of training set: %d   size of test set: %d" % (X_train.shape[0], X_test.shape[0]))

best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters
        # train an SVC
        svm = SVC(gamma=gamma, C=C)
        svm.fit(X_train, y_train)
        # evaluate the SVC on the test set 
        score = svm.score(X_test, y_test)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}
            
print("best score: ", best_score)
print("best parameters: ", best_parameters)

In [None]:
best_score

## The danger of overfitting the parameters and the validation set

In [None]:
print("threefold_split")
mglearn.plots.plot_threefold_split()

In [None]:
from sklearn.svm import SVC
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0)
# split train+validation set into training and validation set
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)

print("Size of training set: %d   size of validation set: %d   size of test set: %d" % (X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters
        # train an SVC
        svm = SVC(gamma=gamma, C=C)
        svm.fit(X_train, y_train)
        # evaluate the SVC on the test set 
        score = svm.score(X_valid, y_valid)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}

# rebuild a model on the combined training and validation set, and evaluate it on the test set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print("best score on validation set: ", best_score)
print("best parameters: ", best_parameters)
print("test set score with best parameters: ", test_score)

## Grid-search with cross-validation

This custom grid search example selects by the mean of cross validation scores taken from 5 folds.

In [None]:
# reference: manual_grid_search_cv
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters
        # train an SVC
        svm = SVC(gamma=gamma, C=C)
        # perform cross-validation
        scores = cross_val_score(svm, X_trainval, y_trainval, cv=5)
        # compute mean cross-validation accuracy
        score = np.mean(scores)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}
# rebuild a model on the combined training and validation set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)

In [None]:
mglearn.plots.plot_cross_val_selection()

### `sklearn.model_selection.GridSearchCV`

[GridSearchCV](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html) automates the looping-over-models we have done in the cells above.

`GridSearchCV` takes a `param_grid` dictionary to specify the parameter search space.  

The example below will end up trying 36 models (`len(param_grid['C']) * len(param_grid['gamma'])`).

In [None]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
param_grid

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
grid_search = GridSearchCV(SVC(), param_grid, cv=5)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.score(X_test, y_test)

In [None]:
print(grid_search.best_params_)
print(grid_search.best_score_)

In [None]:
grid_search.best_estimator_

## Analyzing the result of cross-validation

The `GridSearchCV` object after fitting will have the attribute `grid_scores_` that is a sequence of mean validation scores and parameters for each model.  Note the `grid_scores_` interface is in transition as of scikit-learn version 0.17 - 0.20.

In [None]:
grid_search.cv_results_

In [None]:
# Can easily convert to dataframe
import pandas as pd
scores = pd.DataFrame(grid_search.cv_results_)
scores.head(n=3)

In [None]:
# Get the mean of cross val scores for each item in grid
scores = np.array(scores.mean_test_score).reshape(6, 6)

# plot the mean cross-validation scores
mglearn.tools.heatmap(scores, xlabel='gamma', ylabel='C', xticklabels=param_grid['gamma'],
                      yticklabels=param_grid['C'], cmap="viridis");

In [None]:
# trial and error with different parameter grids, trying log vs linear spacing params
fig, axes = plt.subplots(1, 3, figsize=(13, 5))

param_grid_linear = {'C': np.linspace(1, 2, 6),
                     'gamma':  np.linspace(1, 2, 6)}

param_grid_one_log = {'C': np.linspace(1, 2, 6),
                     'gamma':  np.logspace(-3, 2, 6)}

param_grid_range = {'C': np.logspace(-3, 2, 6),
                     'gamma':  np.logspace(-7, -2, 6)}

for param_grid, ax in zip([param_grid_linear, param_grid_one_log,
                           param_grid_range], axes):
    grid_search = GridSearchCV(SVC(), param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    scores = pd.DataFrame(grid_search.cv_results_)
    scores = np.array(scores.mean_test_score).reshape(6, 6)

    # plot the mean cross-validation scores
    scores_image = mglearn.tools.heatmap(scores, xlabel='gamma', ylabel='C', xticklabels=param_grid['gamma'],
                                         yticklabels=param_grid['C'], cmap="viridis", ax=ax)
    
plt.colorbar(scores_image, ax=axes.tolist())
print("gridsearch_failures")

## Nested cross-validation

So far we have seen cross validation of each model within a grid of parameters, but we can also nest the `GridSearchCV` itself in an outer cross validation cycle, as shown below:

In [None]:
scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5), iris.data, iris.target, cv=5)
print("Cross-validation scores: ", scores)
print("Mean cross-validation score: ", scores.mean())

In [None]:
def nested_cv(X, y, inner_cv, outer_cv, Classifier, parameter_grid):
    outer_scores = []
    # for each split of the data in the outer cross-validation
    # (split method returns indices)
    for training_samples, test_samples in outer_cv.split(X, y):
        # find best parameter using inner cross-validation:
        best_parms = {}
        best_score = -np.inf
        # iterate over parameters
        for parameters in parameter_grid:
            # accumulate score over inner splits
            cv_scores = []
            # iterate over inner cross-validation
            for inner_train, inner_test in inner_cv.split(X[training_samples], y[training_samples]):
                # build classifier given parameters and training data
                clf = Classifier(**parameters)
                clf.fit(X[inner_train], y[inner_train])
                # evaluate on inner test set
                score = clf.score(X[inner_test], y[inner_test])
                cv_scores.append(score)
            # compute mean score over inner folds
            mean_score = np.mean(cv_scores)
            if mean_score > best_score:
                # if better than so far, remember parameters
                best_score = mean_score
                best_params = parameters
        # build classifier on best parameters using outer training set
        clf = Classifier(**best_params)
        clf.fit(X[training_samples], y[training_samples])
        # evaluate 
        outer_scores.append(clf.score(X[test_samples], y[test_samples]))
    return outer_scores

In [None]:
from sklearn.model_selection import ParameterGrid, StratifiedKFold
nested_cv(iris.data, iris.target, StratifiedKFold(5), StratifiedKFold(5), SVC, ParameterGrid(param_grid))

# Summary

In this notebook, we reviewed the following topics in preparation for more advanced topics:

 * [Train / test split](#Train-/-test-split)
 * [Cross-validation](#Cross-validation)
 * [Cross-validation in scikit-learn](#Cross-validation-in-scikit-learn)
 * [Leave-One-Out cross-validation](#Leave-One-Out-cross-validation)
 * [Shuffle-Split cross-validation](#Shuffle-Split-cross-validation)
 * [Grid Search](#Grid-Search)
 * [Grid-search with cross-validation](#Grid-search-with-cross-validation)
 * [Analyzing the result of cross-validation](#Analyzing-the-result-of-cross-validation)
 

<a href='Cross_Validation_and_Grid_Searches_Exercises.ipynb' class='btn btn-primary btn-lg'>Exercises</a>

<img src='img/copyright.png'>