# K-fold cross-validation

Splitting our data into training and test sets is a great way to evaluate our model's out-of-sample performance, but it comes at a high cost: it actually increases the propensity to overfit. The reason for this is that we're effectively halving our training sample. As a result, our model has less data to work with, which means it will be more likely to capitalize on chance and fit noise (if this isn't completely intuitive to you yet, I suggest going back to the interactive plot at the end of the previous section and fiddling with it some more).

Is there a way to have our cake and eat it too? As it turns out, there is—at least, mostly. The solution is to use a form of cross-validation known as <i>k</i>-fold cross-validation. The idea here is very similar to splitting our data into training and testing halves. In fact, if we set <i>k</i>—a parameter that represents the number of *folds*, or data subsets—to 2, we again end up with two discrete subsets of the data.

But now, there's an important twist: instead of using one half of the data for training and the other half for testing, we're going to use both halves for both training and testing. The key is that we'll take turns. First, we'll use Half 1 to train, and Half 2 to test; then, we'll reverse the process. Our final estimate of the model's out-of-sample performance is obtained by averaging the performance estimates we got from the two testing halves. In this way, we've managed to use every single one of our data points for both training and testing, but—critically—never for both at the same time.

Of course, we don't have to set <i>k</i> to 2; we can set it to any other value between 2 and the total sample size <i>n</i>. At the limit, if we set <i>k = n</i>, the approach is called  *leave-one-out cross-validation* (because in every fold, we leave out a single data point for testing, and use the rest of the dataset for training). In practice, <i>k</i> is most commonly set to a value in the range of 3 - 10 (there are principled reasons to want to avoid large values of <i>k</i> in many cases, but we won't get into the details here).

<div align="center" style="font-size: 12px;"><img src="images/kfoldcv.png" width="900">
Image from <a href="http://karlrosaen.com/ml/learning-log/2016-06-20/">http://karlrosaen.com/ml/learning-log/2016-06-20/</a>
</div>

### K-folds the explicit way
To illustrate how k-folds cross-validation works, let's implement it ourselves. First, we create <i>k</i> different subsets of the original dataset. Then, we loop over the <i>k</i> subsets and, in each case, use the current subset to test the model trained on the remaining <i>k</i>-1 subsets. Finally, we average over the performance estimates obtained from all <i>k</i> folds to obtain our overall out-of-sample performance estimate. If you're not interested in wading through the code, you can skip to the next subsection, where we replace most of this with a single line.

In [5]:
# Number of folds
K = 5

# initialize results placeholders
train_r2 = []
test_r2 = []

# our humble steed
est = LinearRegression()

# create list of indexes and randomize order. if we don't
# do this, our folds may be unbalanced if row order is
# confounded with other factors.
from random import shuffle
inds = list(range(len(items)))
shuffle(inds)

# Loop over the k folds
for k in range(K):

    # assign every index i to one of k clusters. note that
    # the conditional will only pass for (1/k)% of indices
    train = [x for (i, x) in enumerate(inds) if i % K != k]

    # any indices not in the training set must be in the test
    test = list(set(inds) - set(train))

    # assign X and y train/test subsets to new variables
    X_train, X_test = items.iloc[train], items.iloc[test]
    y_train, y_test = age.iloc[train], age.iloc[test]
    
    # fit the linear regression to only the training data
    est.fit(X_train, y_train)
    
    # compute scores separately for train and test
    _train_r2 = est.score(X_train, y_train)
    _test_r2 = est.score(X_test, y_test)

    # save the R2 scores for this fold
    train_r2.append(_train_r2)
    test_r2.append(_test_r2)

# compute the mean r2 values over all folds
train_mean = np.array(train_r2).mean()
test_mean = np.array(test_r2).mean()

# let's see...
print(f"Mean training R^2 over folds: {train_mean:.2f}")
print(f"Mean training R^2 over folds: {test_mean:.2f}")

Mean training R^2 over folds: 0.68
Mean training R^2 over folds: 0.15


Notice that our out-of-sample predictive performance is now much better than it was before. Why do you think this is?

### K-folds the easy way
K-folds is an extremely common validation strategy, so any machine learning package worth its salt should provide us with some friendly tools we can use to avoid having to reimplement the basic procedure over and over. In scikit-learn, the `cross_validation` module contains several useful utilities. We've already seen `train_test_split`, which we could use to save us some time. But if all we want to do is get cross-validated scores for some estimator, it's even faster to use the `cross_val_score` function:

In [6]:
from sklearn.model_selection import cross_val_score

# number of folds
K = 10

est = LinearRegression()

# cross_val_score takes an estimator, our variables, and an
# optional specification of the cross-validation procedure.
# integers are interpreted as the number of folds to use
# in a k-folds partitioning.
r2_cv = cross_val_score(est, items, age, cv=K)

print("Individual fold scores:", r2_cv)
print(f"\nMean cross-validated R^2: {r2_cv.mean():.2f}")

Individual fold scores: [ 0.3072363   0.17566146  0.2888488  -0.02070723  0.30029865  0.26647716
  0.01426081  0.12537567  0.3904625   0.09826141]

Mean cross-validated R^2: 0.19


That's it! We were able to replace nearly all of our code above with one function call. If you find this a little *too* magical, scikit-learn also has a bunch of other utilities that offer an intermediate level of abstraction (e.g., the `sklearn.model_selection.KFold` class will generate the folds for you, but will return the training and test indices for you to loop over, rather than automatically cross-validating your estimator).

Notice that, once again, our cross-validated estimate is a little bit higher than it was in the previous version. The explanation for this improvement is the same as before. (*Hint*: what happens to the training/test allocation as we increase the number of folds? And for bonus points, why might we not automatically want to use the largest value of *k*—i.e., leave-one-out-cross-validation—just because it's likely to give us a more favorable estimate of performance?)