# n-Fold Cross Validation

When we have a small data set, we can use n-fold cross validation to split the dataset into n-folds, and iteratively use one fold for validation and the remaining folds for testing. The experiment is repeated n times, so n models are learned and we can take the average over the evaluation metric over these n experiments.

# Data

We can use the `KFold` class in SKLearn to split the data into k-folds. KFold.split(X) gives us an iterator over the k-folds returning indices for the training and validation set. 

In [1]:
from ml import *

#### Load the boston dataset in Pandas. Use price as the target variable and lstat and age as the features.

In [2]:
df = boston_pd()
y = df['price']
X = df[['lstat', 'age']]

#### Complete the code to train a LinearRegression model in every pass and use the predictions to compute the R2 score. Store all the R2 scores in the List r2 and finally the average is printed. You should get an R2 score over 0.27

In [10]:
from sklearn.model_selection import KFold

# a KFold generates train and validation splits. The split contains index numbers so we can apply
# the same split to both X and y.
kf = KFold(n_splits=5)

r2 = []
for train_ind, valid_ind in kf.split(X):

    train_X = X.iloc[train_ind]
    valid_X = X.iloc[valid_ind]
    train_y = y.iloc[train_ind]
    valid_y = y.iloc[valid_ind]
    mean_y = sum(valid_y) / len(valid_y) # mean over the TRAINING SET
    # ...
    model = LinearRegression()
    model.fit(train_X, train_y)
    y_pred = model.predict(valid_X)      # predictions over the VALIDATION SET
    r2_nominator = sum([ (yp - y)**2 for yp, y in zip(y_pred, valid_y)])
    r2_denominator = sum([ (y - mean_y)**2 for y in valid_y])
    r2.append( 1 - r2_nominator / r2_denominator)
    #r2.append(r2_score(valid_y, y_pred))

print(sum(r2) / len(r2))

0.2744340774851159
