# Cross Validation

Prof. Dr. Georgios K. Ouzounis<br/>
[georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/02/17085331/scikit-learn-logo.png" alt="sci-kit learn" width="300" style="float: left; margin-right: 10px;" />

The contents of this session are taken directly from the source site
http://scikit-learn.org/stable/index.html 

## Contents

- cross validation
- computing cross validated metrics
- cross_validate() function and multiple metric evaluation
- cross validation iterators

## Cross validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake!

A model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. 

This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set **X_test, y_test**. 

In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. Let’s load the iris data set to fit a linear support vector machine on it:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

When evaluating different settings (**“hyperparameters”**) for estimators, such as the **C setting** that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. 

This way, knowledge about the test set can **“leak”** into the model and evaluation metrics no longer report on generalization performance. 

To solve this problem, yet another part of the dataset can be held out as a so-called **“validation set”**: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

Partitioning the available data into three sets drastically reduces the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets !!!

A solution to this problem is a procedure called **cross-validation (CV for short)**. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called **k-fold CV**, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). 


The following procedure is followed for each of the k “folds”:

- a model is trained using k-1 of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. 

This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

## Computing cross-validated metrics

The simplest way to use cross-validation is to call the **cross_val_score()** helper function on the estimator and the dataset. 

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
clf = svm.SVC(kernel='linear', C=1)

In [None]:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores

The mean score and the 95% confidence interval of the score estimate are hence given by:

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:


In [None]:
from sklearn import metrics

In [None]:
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
scores

It is also possible to use other cross validation strategies by passing a **cross validation iterator** instead, for instance:

In [None]:
from sklearn.model_selection import ShuffleSplit

In [None]:
n_samples = iris.data.shape[0]


In [None]:
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

In [None]:
cross_val_score(clf, iris.data, iris.target, cv=cv)

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar [data transformations](http://scikit-learn.org/stable/data_transforms.html#data-transforms) similarly should be learnt from a training set and applied to held-out data for prediction:

In [None]:
from sklearn import preprocessing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)

In [None]:
X_train_transformed = scaler.transform(X_train)

In [None]:
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)

In [None]:
X_test_transformed = scaler.transform(X_test)

In [None]:
clf.score(X_test_transformed, y_test)

## cross_validate() function and multiple metric evaluation


The **cross_validate()** function differs from **cross_val_score()** in two ways 

- it allows specifying multiple metrics for evaluation,
- it returns a dict containing training scores, fit-times and score-times in addition to the test score.

For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be ['test_score', 'fit_time', 'score_time']


And for multiple metric evaluation, the return value is a dict with the following keys -['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time'] return_train_score is set to **True** by default. 

It adds train score keys for all the scorers. 

If train scores are not needed, this should be set to False explicitly.


The multiple metrics can be specified either as a list, tuple or set of predefined scorer names:


In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score


In [None]:
scoring = ['precision_macro', 'recall_macro']


In [None]:
clf = svm.SVC(kernel='linear', C=1, random_state=0)


In [None]:
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=False)

In [None]:
sorted(scores.keys())

In [None]:
scores['test_recall_macro']

Or as a dict mapping scorer name to a predefined or custom scoring function:

In [None]:
from sklearn.metrics.scorer import make_scorer

In [None]:
scoring = {'prec_macro': 'precision_macro','rec_micro': make_scorer(recall_score, average='macro')}

In [None]:
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=True)

In [None]:
sorted(scores.keys())

In [None]:
scores['train_rec_micro'] 

## Cross validation iterators

Assuming that some data is **Independent and Identically Distributed (i.i.d.)** is making the assumption that all samples stem from the same generative process and that the generative process is assumed to have no memory of past generated samples.

**Note:** While i.i.d. data is a common assumption in machine learning theory, it rarely holds in practice. More sophisticated approaches exist for such cases but will not be discussed here.


### k-fold cross validation

[k-fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) divides all the samples in **k** groups of samples, called folds of equal sizes (if possible). 

The prediction function is learned using k-1 folds, and the fold left out is used for test.

<a href="https://my.oschina.net/Bettyty/blog/751627"><img src="https://static.oschina.net/uploads/img/201609/26155106_OfXx.png" alt="k-fold cv" width="500" style="float: left; margin-right: 10px;"></a>

Example of 2-fold cross-validation on a dataset with 4 samples:

In [None]:
import numpy as np
from sklearn.model_selection import KFold

In [None]:
X = ["a", "b", "c", "d"]


In [None]:
kf = KFold(n_splits=2)

In [None]:
for train, test in kf.split(X):
    print("%s %s" % (train, test))

Each fold is constituted by two arrays: the first one is related to the *training set*, and the second one to the *test set*. Thus, one can create the training/test sets using numpy indexing:

In [None]:
X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])

In [None]:
y = np.array([0, 1, 0, 1])

In [None]:
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

[**RepeatedKFold()**](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html#sklearn.model_selection.RepeatedKFold) repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.  Example of 2-fold K-Fold repeated 2 times:

In [None]:
import numpy as np
from sklearn.model_selection import RepeatedKFold

In [None]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

In [None]:
random_state = 12883823

In [None]:
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)

In [None]:
for train, test in rkf.split(X):
    print("%s %s" % (train, test))

### resources

| blog | article |
|:------|:---------|
| <img src="http://www.euro-langues.org/wp-content/uploads/2019/05/1*F0LADxTtsKOgmPa-_7iUEQ.jpeg" alt="towardsdatascience" width="100" style="float: left; margin-right: 10px;" /> | [Cross-Validation](https://towardsdatascience.com/cross-validation-70289113a072) by Georgios Drakos, Aug. 16, 2018 in Towards Data Science |
| <img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/02/17085331/scikit-learn-logo.png" alt="sci-kit learn" width="200" style="float: left; margin-right: 10px;" /> | [3.1. Cross-validation: evaluating estimator performance](http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo) by scikit-learn.org, © 2007 - 2017, scikit-learn developers (BSD License). |