# Cross Validation (CV)

The data is split into n equal chuncks. In each run one chunck is taken as the test data and others for training data. Based on that, the R^2 calculated is more precise on 5 values. From those mean and median and 95% confident intervals can be calculated. Thus its called **k-fold** cross calidation or in general **n-fold CV**.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

### CV on gapminder data

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

In [2]:
gapminder = pd.read_csv('regression/data/gm_2008_region.csv', header=0)
gapminder.head(2)

Unnamed: 0,population,fertility,HIV,CO2,BMI_male,GDP,BMI_female,life,child_mortality,Region
0,34811059.0,2.73,0.1,3.328945,24.5962,12314.0,129.9049,75.3,29.5,Middle East & North Africa
1,19842251.0,6.43,2.0,1.474353,22.25083,7103.0,130.1247,58.3,192.0,Sub-Saharan Africa


In [8]:
# data
X = gapminder.drop(['Region', 'life'], axis=1).values
y = gapminder.life.values

# init regression obj
reg = LinearRegression()

# compute 5-fold CV scores
cv_scores = cross_val_score(estimator=reg, X=X, y=y, cv=5)

# print the R^2 scores
print(cv_scores)

# compute mean R^2 score
print("\nThe average R^2 score is {:.2f}".format(np.mean(cv_scores)))

[0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]

The average R^2 score is 0.86


The more folds we use the more computationally expensive it gets.

In [9]:
%timeit cross_val_score(reg, X, y, cv=3)

3.35 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
%timeit cross_val_score(reg, X, y, cv=15)

15.8 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
