# $k$-fold cross-validation

This is a short note on **$k$-fold cross-validation** in data science.

The best way to find out how it's implemented in `sklearn` is to look up <a href="https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation">the official user guide</a>.

In general, cross-validation is not a training algorithm, it's more of handling the data before applying any learning algorithm onto it.

When training a model on labeled data, we split the data into two parts, the training set and the test set, to measure the performance on prediction.

This is an important practice, otherwise one can always overfit the data at the cost of losing its ability to predict new samples.

However, in practice, we don't always have the abunduncy of labeled data. One wishes to be wise with the given limited size of the data.

$k$-fold cross-validation is a classical method to measure how good our model and the choice of hyperparameters are given a data set, especially when the size of the data is not as large as desired.

Idea: Suppose we want to train a model and see if we made a good choice of a hyperparameter.
- Shffule the samples and split it into $k$ number of groups.
- For $i = 1, ..., k$, leave the $i$'th group out as the test group, and train the model on the union of the rest $k-1$ groups.
- For each $i$, we have a performance measure like the accuracy of the model. Average them out as the final measure of the accuracy of the model.

In summary, cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance (<a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)">Wikipedia</a>).

Then what should the number $k$ be?

We don't have the magic number $k(n)$ that works perfectly on each sample size $n$. However, if the sample size is small like those from a traditional medical experiment, to increase validity, one could use $k = n$.
- Leave one random sample out as a test set.
- Train the model on the rest $n-1$ samples.
- Repeat.

This method is also called leave-one-out cross-validation (LOOCV).

Here's a sample code for the `sklearn` implementation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris()
print(type(iris))
iris.keys()

<class 'sklearn.utils.Bunch'>


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

According to <a href="https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation">the official user guide</a>, we leave the test set out and run the $k$-fold validation on the training set, but the sample code shown on the same page shows that the $k$-fold validation was run on the whole set.

In [2]:
# k-fold cross-validation
from sklearn import svm
from sklearn import model_selection

x = iris.data
y = iris.target
model = svm.SVC(kernel = 'linear', C=1)
scores = model_selection.cross_val_score(model, x, y, cv = 5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


By default, `model_selection.cross_val_score()` stores the accuracy of the trained model over each loop.

There are <a href="https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter">other options</a> available as well.

In [3]:
scores = model_selection.cross_val_score(model, x, y, cv = 5, scoring = 'f1_macro')
print("f1: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

f1: 0.98 (+/- 0.03)
