# Model Selection

The functions used here are listed below:

- [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)
- [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

## Cross Validation

In [12]:
import numpy as np
from sklearn import svm, preprocessing, datasets
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In scikit-learn a random split into training and test sets can be quickly computed with the `train_test_split` helper function. `train_size` and `test_size` parameters can be used to specify the sizes of training and test sets. The default is 75% in train and 25% in test.

In [13]:
iris = datasets.load_iris()
print(iris.data.shape, iris.target.shape)

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, 
                                                                    test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(clf.score(X_test, y_test)) 

(150, 4) (150,)
(90, 4) (90,)
(60, 4) (60,)
0.966666666667


### Cross Validation using `cross_val_score`

The simplest way to use cross-validation is to call the `cross_val_score` helper function on the estimator and the dataset. **It can be parallelized using `n_jobs` argument.It preserves the percentage of samples for each class for labeled data.**

In [14]:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.96666667  1.          0.96666667  0.96666667  1.        ]
Accuracy: 0.98 (+/- 0.03)


By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the `scoring` parameter.

### Cross Validation using `cross_validate`

The `cross_validate` function differs from cross_val_score in two ways -

It allows specifying multiple metrics for evaluation.
It returns a dict containing training scores, fit-times and score-times in addition to the test score.

Here is [an example](http://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation) showing the usage of `corss_validate`.

## Pipeline

A pipeline of transforms with a final estimator. Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

### Without Pipeline

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:

In [15]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)
clf.score(X_test_transformed, y_test) 

0.93333333333333335

### With Pipeline

In [16]:
pipe = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
print(cross_val_score(clf, iris.data, iris.target, cv=10))


0.933333333333
[ 1.          0.93333333  1.          1.          1.          0.93333333
  0.93333333  1.          1.          1.        ]


## Tuning the hyper parameters of an estimator

Two generic approaches to sampling search candidates are provided in scikit-learn: for given values, `GridSearchCV` exhaustively considers all parameter combinations, while `RandomizedSearchCV` can sample a given number of candidates from a parameter space with a specified distribution.

### Exhaustive Grid Search

In [17]:
# Loading the Digits dataset
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf', 'linear'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]}]

In [18]:
clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=10, n_jobs=-1)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print(clf.best_params_)
clf.cv_results_

Best parameters set found on development set:
{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}


{'mean_fit_time': array([ 0.07720406,  0.02886918,  0.06144266,  0.02776921,  0.0733516 ,
         0.02796974,  0.03527462,  0.03427384,  0.06794791,  0.03367379,
         0.03572505,  0.04703333,  0.12724025,  0.06049283,  0.07220111,
         0.05754068]),
 'mean_score_time': array([ 0.00610518,  0.00405285,  0.00740519,  0.00370264,  0.00755544,
         0.00290213,  0.00410295,  0.00290256,  0.00580423,  0.00420294,
         0.00390267,  0.00740511,  0.01075768,  0.00535431,  0.01145835,
         0.00550413]),
 'mean_test_score': array([ 0.98329621,  0.97438753,  0.95879733,  0.97438753,  0.98329621,
         0.97438753,  0.98106904,  0.97438753,  0.98329621,  0.97438753,
         0.98218263,  0.97438753,  0.98329621,  0.97438753,  0.98218263,
         0.97438753]),
 'mean_train_score': array([ 0.99888658,  1.        ,  0.96881809,  1.        ,  1.        ,
         1.        ,  0.99814385,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.   

In [19]:
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       0.99      1.00      0.99        76
          5       0.99      0.97      0.98       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899



## A classification example showing the use of GridSearchCV

[This example](http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py) shows how a classifier is optimized by cross-validation on a development set that comprises only half of the available labeled data.

The performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step.

**We use optimize the precision and recall measures during grid search. This is useful in case of classification algorithms.**

In [20]:
# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score, n_jobs = -1)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.959 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.025) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.982 (+/-0.025) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.975 (+/-0.014) for {'C': 1, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 10, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 100, 'kernel': 'linear'}
0.975 (+/-0.014) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed o