# n-Fold Cross Validation

When we have a small data set, we can use n-fold cross validation to split the dataset into n-folds, and iteratively use one fold for validation and the remaining folds for testing. The experiment is repeated n times, so n models are learned and we can take the average over the evaluation metric over these n experiments.

# Data

We can use the `KFold` class in SKLearn to split the data into k-folds. KFold.split(X) gives us an iterator over the k-folds returning indices for the training and validation set. 

In [6]:
from ml import advertising_pd

In [7]:
df = advertising_pd()
y = df['Sales']
X = df[['TV', 'Radio']]

In [13]:
from sklearn.metrics import mean_squared_error

def sum_squared_error(model, X, y):
    """return the sum of squared errors over the datapairs in X, y"""
    y_pred = model.predict(X)
    return mean_squared_error(y, y_pred) * len(y)

In [15]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from math import sqrt

# a KFold generates train and validation splits. The split contains index numbers so we can apply
# the same split to both X and y.
kf = KFold(n_splits=3)

train_error = 0
valid_error = 0
for train_ind, valid_ind in kf.split(X):
    train_X = X.iloc[train_ind]
    valid_X = X.iloc[valid_ind]
    train_y = y.iloc[train_ind]
    valid_y = y.iloc[valid_ind]
    model = LinearRegression()
    model.fit(train_X, train_y)
    train_error += sum_squared_error(model, train_X, train_y)
    valid_error += sum_squared_error(model, valid_X, valid_y)

print(sqrt(train_error/ len(train_X)), sqrt(valid_error / len(valid_X)))

2.8289825607751085 2.9294040867710303
