### INTRO

* common practice: reserve part of data set as a **test set**.
* training/test splits easily done by [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)
* basic approach = "k-fold" CV: training set split into k smaller sets
   * training done on k-1 subsets
   * validation done on remaining subset


In [48]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

In [49]:
# hold 40% of data for testing
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((90, 4), (90,), (60, 4), (60,))

In [50]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.96666666666666667

### Metrics

[API](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) |
[demo1](plot_cv_diabetes.ipynb) |
[demo2](plot_cv_digits.ipynb) |
[underfit vs overfit](plot_underfitting_overfitting.ipynb) |
[nested vs unnested](plot_nested_cross_validation_iris.ipynb)

* simplest example of CV: [cross_val_score]
* estimates accuracy of linear SVM on iris dataset

In [51]:
# CV base example - return mean score & 95% confidence interval
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print("Accuracy:",(scores.mean(), scores.std()*2))

Accuracy: (0.98000000000000009, 0.032659863237109031)


In [52]:
# CV with a different scoring method
from sklearn import metrics
scores = cross_val_score(
    clf, iris.data, iris.target, 
    cv=5, 
    scoring='f1_macro')
print("Accuracy:",(scores.mean(), scores.std()*2))

Accuracy: (0.9799498746867169, 0.032741717530936382)


In [53]:
# CV with custom iterator
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]

cv = ShuffleSplit(
    n_splits=3, 
    test_size=0.3, 
    random_state=0)

scores = cross_val_score(
    clf, iris.data, iris.target, cv=cv)
print("Accuracy:",(scores.mean(), scores.std()*2))

Accuracy: (0.98518518518518527, 0.020951312035156995)


### Predictions via CV

* returns prediction value for each element in input

[cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict) | [demo](plot_cv_predict.ipynb)


In [54]:
#example
from sklearn.model_selection import cross_val_predict

predicted = cross_val_predict(
    clf, iris.data, iris.target, cv=10)

metrics.accuracy_score(iris.target, predicted) 

0.97333333333333338

[demo: ROC with CV](plot_roc_crossval.ipynb) | 
[recursive feature elimination (RFE) with CV](plot_rfe_with_cross_validation.ipynb) | 
[param estimation using grid search with CV](grid_search_digits.ipynb) 

[sample pipeline, text feature extract/eval](grid_search_text_feature_extraction.ipynb) | 
[plot CV'd predictions](plot_cv_predict.ipynb) |
[nested vs non-nested CV](plot_nested_cross_validation_iris.ipynb)

### Iterators (Utilities to generate indices for dataset splits)

### K-Fold

* Divide all samples into k groups of ideally equal sizes
* training done with k-1 groups; test with remaining set

In [55]:
# K=2 fold CV, 4-sample dataset
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


### Leave One Out

* Each set = taking all samples but one (the test set)
* for n samples: n different training sets, n different test sets
* computationally more expensive than K-fold
* rule of thumb: 5x-10x cv preferred to LOO.

In [56]:
# leave-one-out (LOO)
from sklearn.model_selection import LeaveOneOut

X = [1,2,3,4,5,6,7,8]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3 4 5 6 7] [0]
[0 2 3 4 5 6 7] [1]
[0 1 3 4 5 6 7] [2]
[0 1 2 4 5 6 7] [3]
[0 1 2 3 5 6 7] [4]
[0 1 2 3 4 6 7] [5]
[0 1 2 3 4 5 7] [6]
[0 1 2 3 4 5 6] [7]


### Leave P out

* removes p samples from complete set

In [57]:
# leave P out (LPO)
from sklearn.model_selection import LeavePOut

X = np.ones(6)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[2 3 4 5] [0 1]
[1 3 4 5] [0 2]
[1 2 4 5] [0 3]
[1 2 3 5] [0 4]
[1 2 3 4] [0 5]
[0 3 4 5] [1 2]
[0 2 4 5] [1 3]
[0 2 3 5] [1 4]
[0 2 3 4] [1 5]
[0 1 4 5] [2 3]
[0 1 3 5] [2 4]
[0 1 3 4] [2 5]
[0 1 2 5] [3 4]
[0 1 2 4] [3 5]
[0 1 2 3] [4 5]


### Shuffle & Split (Random Permutations)

* generates user-defined #independent train/test splits
* good alternative to K-fold (finer #iterations controls)

In [58]:
# shuffle and split
from sklearn.model_selection import ShuffleSplit
X = np.arange(8)
ss = ShuffleSplit(
    n_splits=4, 
    test_size=0.25, 
    random_state=0)

for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

[1 7 3 0 5 4] [6 2]
[3 7 0 4 2 5] [1 6]
[3 4 7 0 6 1] [5 2]
[6 7 3 4 1 0] [2 5]


### CV iterators with label-based stratification

* use case: classification with large class distribution inbalance

[stratified k-fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) - each test set approx the same #samples from each class as the complete set.

[stratified shuffle split]() - preserves same pct for each target class as in the complete set

In [59]:
# stratified K-fold
from sklearn.model_selection import StratifiedKFold

X = np.ones(12)
y = [0,0,0,0,1,1,1,1,1,2,2,2]

skf = StratifiedKFold(
    n_splits=3)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[ 2  3  6  7  8 10 11] [0 1 4 5 9]
[ 0  1  3  4  5  8  9 11] [ 2  6  7 10]
[ 0  1  2  4  5  6  7  9 10] [ 3  8 11]


In [60]:
from sklearn.model_selection import StratifiedShuffleSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

sss = StratifiedShuffleSplit(
    n_splits=3, 
    test_size=0.5, 
    random_state=0)
sss.get_n_splits(X, y)
      
for train_index, test_index in sss.split(X, y):
   print(train_index, test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

[1 2] [3 0]
[0 2] [1 3]
[0 2] [3 1]


### CV iterators for grouped (dependent) data

* Domain-specific problem
* Use case: determine if a model trained on a specific group generalizes well to the unseen groups.

### Grouped K-fold

* Ensures a given group not present in both training & test sets

In [61]:
# example: 3 subjects (1-3); each subject in different test fold
from sklearn.model_selection import GroupKFold

X =      [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y =      ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1,    1,   1,   2,   2,   2,   3,   3,   3,   3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


### [Leave One Group Out](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut)

In [62]:
# leave one group out
from sklearn.model_selection import LeaveOneGroupOut

X =      [1, 5, 10, 50, 60, 70, 80]
y =      [0, 1,  1,  2,  2,  2,  2]
groups = [1, 1,  2,  2,  3,  3,  3]
logo = LeaveOneGroupOut()
for train, test in logo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]


### [Leave P Groups out](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut)

* Generates sequence of random partitions - subset of groups are held out for each split

In [63]:
# leave P groups out
from sklearn.model_selection import LeavePGroupsOut

X =      np.arange(6)
y =      [1, 1, 1, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3]
lpgo = LeavePGroupsOut(n_groups=2)
for train, test in lpgo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]


### [Group Shuffle Split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit)

* use case: when behavior of LeavePGroupsOut is needed, but #groups is prohibitively large

In [64]:
# group shuffle split
from sklearn.model_selection import GroupShuffleSplit

X =      [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y =      ["a", "b", "b", "b", "c", "c", "c", "a"]
groups = [1,    1,   2,   2,   3,   3,   4,   4]

gss = GroupShuffleSplit(
    n_splits=4, 
    test_size=0.5, 
    random_state=0)

for train, test in gss.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]


### [Predefined splts](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit)

In [65]:
from sklearn.model_selection import PredefinedSplit

X =         np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y =         np.array([ 0,      0,      1,      1])
test_fold =          [ 0,      1,     -1,      1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()

print(ps)       

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

PredefinedSplit(test_fold=array([ 0,  1, -1,  1]))
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]


### [Time Series Split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit)

* returns first k folds as training set, next k+1 fold as test set.
* successive training sets are supersets of previous ones

In [66]:
# example, 3-series split, 6 items
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  

for train, test in tscv.split(X):
    print("%s %s" % (train, test))

TimeSeriesSplit(n_splits=3)
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


### Shuffling

* use case: when data order is not arbitrary
* some iterators (ex: K-fold) have built-in shuffling option