# Cross validation
based on [official docs](http://scikit-learn.org/stable/modules/cross_validation.html)


we need:
- **training set** for training model
- **validation set** for hyper parameters
- **test set** 
solution:
- k-fold -> split **training set** to k folds, get k-1 for **training** and 1 for **validation**

In [1]:
import numpy as np
from sklearn import datasets, metrics, model_selection, pipeline, preprocessing, svm

iris = datasets.load_iris()

In [2]:
clf = svm.SVC(kernel='linear', C=1)
scores = model_selection.cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print('Accuracy: {:0.2f} (+/- {:0.2f})'.format(scores.mean(), scores.std() * 2))

[ 0.96666667  1.          0.96666667  0.96666667  1.        ]
Accuracy: 0.98 (+/- 0.03)


## Pipeline
we should have the same preprocessing for training and using

In [3]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

cv = model_selection.ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
clf = pipeline.make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
print(model_selection.cross_val_score(clf, X_train, y_train, cv=cv))
predicted = model_selection.cross_val_predict(clf, X_train, y_train, cv=3)

print(metrics.accuracy_score(y_train, predicted))

# clf = clf.fit(X_train, y_train)
# clf.predict(X_test) == y_test

[ 0.96296296  1.          0.85185185]
0.955555555556


# Cross validation iterators

## k-fold

In [4]:
X = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
kf = model_selection.KFold(n_splits=3)
for train, test in kf.split(X):
    print('{} {}'.format(X[train], X[test]))

['c' 'd' 'e' 'f'] ['a' 'b']
['a' 'b' 'e' 'f'] ['c' 'd']
['a' 'b' 'c' 'd'] ['e' 'f']


## leave one out

In [5]:
loo = model_selection.LeaveOneOut()
for train, test in loo.split(X):
    print('{} {}'.format(X[train], X[test]))

['b' 'c' 'd' 'e' 'f'] ['a']
['a' 'c' 'd' 'e' 'f'] ['b']
['a' 'b' 'd' 'e' 'f'] ['c']
['a' 'b' 'c' 'e' 'f'] ['d']
['a' 'b' 'c' 'd' 'f'] ['e']
['a' 'b' 'c' 'd' 'e'] ['f']


## Imbalanced data

In [6]:
X = np.ones(13)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1])
skf = model_selection.StratifiedKFold(n_splits=3)
print('stratified:')
for train, test in skf.split(X, y):
    print('{} {}'.format(y[train], y[test]))
print('\nk-fold (without stratification):')
kf = model_selection.KFold(n_splits=3)
for train, test in kf.split(X):
    print('{} {}'.format(y[train], y[test]))

stratified:
[0 0 1 1 1 1 1 1] [0 0 1 1 1]
[0 0 0 1 1 1 1 1 1] [0 1 1 1]
[0 0 0 1 1 1 1 1 1] [0 1 1 1]

k-fold (without stratification):
[1 1 1 1 1 1 1 1] [0 0 0 0 1]
[0 0 0 0 1 1 1 1 1] [1 1 1 1]
[0 0 0 0 1 1 1 1 1] [1 1 1 1]


## Grouped data
test one groups agains others and don't mix them together

In [7]:
X = np.arange(8)
y = np.arange(8)
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4])
gss = model_selection.GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
for train, test in gss.split(X, y, groups=groups):
    print('{} {}'.format(groups[train], groups[test]))

[1 1 2 2] [3 3 4 4]
[2 2 4 4] [1 1 3 3]
[2 2 3 3] [1 1 4 4]
[3 3 4 4] [1 1 2 2]


# Time series

In [8]:
X = np.arange(20)
y = np.arange(20)
tscv = model_selection.TimeSeriesSplit(n_splits=3)

for train, test in tscv.split(X):
    print('{} {}'.format(train, test))

[0 1 2 3 4] [5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] [15 16 17 18 19]
