Cross validation is one way to evaluate a machine learning model. It involves splitting the dataset into multiple folds then validating on one after training the model on the rest of the folds. This establishes a reliable performance measure that assesses how the model will likely to generalize to an independent data set. Cross validation is widely used for estimating test error for the following reasons:

* Provides a less biased evaluation, which in turn, helps you reduce overfitting.
* Provides a reliable way to validate model when no explicit validation set is made available.

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

import os, sys
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB

In [3]:
np.random.seed(18937)

In [4]:
datasource = "datasets/winequality-red.csv"

In [5]:
print(os.path.exists(datasource))

True


In [6]:
df = pd.read_csv(datasource).sample(frac = 1).reset_index(drop = True)
df.head()

Unnamed: 0.1,Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,1059,11.2,0.67,0.55,2.3,0.084,6.0,13.0,1.0,3.17,0.71,9.5,6
1,1561,7.0,0.51,0.09,2.1,0.062,4.0,9.0,0.99584,3.35,0.54,10.5,5
2,818,8.2,0.32,0.42,2.3,0.098,3.0,9.0,0.99506,3.27,0.55,12.3,6
3,19,7.0,0.36,0.21,2.4,0.086,24.0,69.0,0.99556,3.4,0.53,10.1,6
4,754,12.1,0.4,0.52,2.0,0.092,15.0,54.0,1.0,3.03,0.66,10.2,5


In [7]:
del df["Unnamed: 0"]

In [8]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.2,0.67,0.55,2.3,0.084,6.0,13.0,1.0,3.17,0.71,9.5,6
1,7.0,0.51,0.09,2.1,0.062,4.0,9.0,0.99584,3.35,0.54,10.5,5
2,8.2,0.32,0.42,2.3,0.098,3.0,9.0,0.99506,3.27,0.55,12.3,6
3,7.0,0.36,0.21,2.4,0.086,24.0,69.0,0.99556,3.4,0.53,10.1,6
4,12.1,0.4,0.52,2.0,0.092,15.0,54.0,1.0,3.03,0.66,10.2,5


In [9]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


Let's use a couple feature columns as input X and the quality column as the output y. Perform a 5-fold cross validation using cross_val_score(). It will split the data into 5 folds (based on the cv arg). It fits the data on 4 folds and scores the 5th fold. It gives you the 5 scores from which you can calculate a mean and variance for the score. This potentially allows you to cross validate in order to tune parameters and get an estimate of the score. The cross validation process involves fitting the model by definition so we don't need to fit the model prior to cross validation.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

<i>sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)</i>

<b>cv :</b> int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:
* None, to use the default 3-fold cross validation,
* integer, to specify the number of folds in a (Stratified)KFold,
* An object to be used as a cross-validation generator.
* An iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

In [10]:
m = GaussianNB()

In [11]:
X = np.array(df.iloc[:, :-1])[:, [1, 2, 6, 9, 10]]

In [12]:
y = np.array(df["quality"])

In [13]:
sklearn.model_selection.cross_val_score(m, X, y, cv = 5)

array([ 0.58074534,  0.59190031,  0.57320872,  0.55660377,  0.61198738])

The array above displays the 5 scores from the 5 fold cross validation. For each round of cross validation, the model was fit on 4 of the folds and scored on the one held out. 