# K-Fold Cross Validation

In [16]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
#iris.data.shape
#iris.data.size
#iris.data[:15]

A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [11]:
# Split the iris data into train/test data sets with 40% reserved for testing and 60% for training.
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

# Build an SVC model for predicting iris classifications using training data for both features and target.
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# Now measure its performance with the test data for feature and target.
clf.score(X_test, y_test)   

# Seeing the score and knowing that iris dataset(http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris)
# has only 150 samples to use we might be seeing good scores due to overfitting here.
# One way to be certain is to use K-Fold cross validation as shown below: 

0.96666666666666667

K-Fold cross validation using a K of 5:

In [8]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[ 0.96666667  1.          0.96666667  0.96666667  1.        ]
0.98


Using a different kernel (poly):

In [9]:
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print(scores.mean())

[ 1.          1.          0.9         0.93333333  1.        ]
0.966666666667


No! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [10]:
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)   

0.96666666666666667

That's the same score we got with a single train/test split on the linear kernel.

## To do:

Need to vary degree of the polynomial in SVC and see its effect on the final result.