# Cross-Validation (CV) -- spliting training and test sets
 
**Author Meng Lu**

This tutorial is adopted from [scikit-learn CV](https://scikit-learn.org/stable/modules/cross_validation.html)

We have already seen the model over-fitting problem in "linear models for regression". Remember we talked about regulation. Modern machine learning models are very flexible and can modele complex relationships, and are prone to over-fitting (i.e. the model is too close to the data (i.e. overemphasize patterns) to be gernarlised). But how to we know the model is over-fitted or not before the next set of samples are predicted?

We need a methodological approach to evaluate models and build trustworthy models, and the most commonly used is Cross-Validation (CV).

In [1]:
import numpy as np
from sklearn.model_selection import KFold, RepeatedKFold, LeaveOneOut, LeavePOut,ShuffleSplit


Leave-one-out CV:
Each learning set is created by taking all the samples except one, the test set being the sample left out. 

In [2]:
X = ["A", "B", "C", "D"]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))
    print([X[i] for i in train],[X[i] for i in test])

[1 2 3] [0]
['B', 'C', 'D'] ['A']
[0 2 3] [1]
['A', 'C', 'D'] ['B']
[0 1 3] [2]
['A', 'B', 'D'] ['C']
[0 1 2] [3]
['A', 'B', 'C'] ['D']


K-fold CV

In [27]:
X = ["A", "B", "C", "D"]
kf = KFold(n_splits=2) 
for train, test in kf.split(X):
    print([X[i] for i in train],[X[i] for i in test])

['C', 'D'] ['A', 'B']
['A', 'B'] ['C', 'D']


Exercise: Using the example above and change it into a 4-fold CV.
    

In [None]:
# if we set the n_splits is len(X), then we are actually doing LOO.

Leave P Out (LPO)
LeavePOut is very similar to LeaveOneOut as it creates all the possible training/test sets by removing  samples from the complete set.  


In [3]:
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
     print([X[i] for i in train],[X[i] for i in test])

['C', 'D'] ['A', 'B']
['B', 'D'] ['A', 'C']
['B', 'C'] ['A', 'D']
['A', 'D'] ['B', 'C']
['A', 'C'] ['B', 'D']
['A', 'B'] ['C', 'D']


Exercise: what is the difference between leavePOut and K-fold CV?

The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.

In [40]:
X = np.arange(20)
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for train, test in ss.split(X):
     print([X[i] for i in train],[X[i] for i in test])

[17, 6, 13, 4, 2, 5, 14, 9, 7, 16, 11, 3, 0, 15, 12] [18, 1, 19, 8, 10]
[12, 19, 16, 10, 0, 3, 4, 15, 8, 13, 9, 5, 14, 7, 6] [11, 1, 18, 17, 2]
[2, 8, 6, 3, 17, 4, 10, 16, 18, 9, 1, 0, 7, 14, 19] [15, 13, 12, 5, 11]
[17, 7, 12, 14, 16, 11, 10, 9, 15, 1, 19, 8, 6, 5, 4] [18, 0, 13, 2, 3]
[18, 8, 17, 15, 16, 6, 13, 11, 4, 10, 9, 12, 3, 14, 0] [7, 1, 2, 19, 5]


Exercise: Using the example above and change it into a 20 splits and test size 0.3
    