# K-Fold Cross Validation

## Dependencies

In [1]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm

## Load Some Data

In [2]:
irisData = datasets.load_iris()

## Split Data into Train & Test Data
A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [3]:
# Split the iris data into train/test data sets with 40% reserved for testing
# train_test_split
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = train_test_split(irisData.data, irisData.target, test_size=0.4, random_state=0)

## Build A Model From Training Data

In [4]:
# Build an SVC model for predicting irisData classifications using training data
# SVC:     https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

svcKernel = 'linear'
clf = svm.SVC(kernel=svcKernel, C=1).fit(X_train, y_train)

## Measure the Model's Performance

In [5]:
# Now measure its performance with the test data
clf.score(X_test, y_test)

0.9666666666666667

## Start K-Fold Cross Validation
[An Article](https://machinelearningmastery.com/k-fold-cross-validation/)

In [6]:
# set a "K" value, the number of samples to use from the dataset
k = 5;

In [7]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
scores = cross_val_score(clf, irisData.data, irisData.target, cv=k)

# Print the accuracy for each fold:
print("scores:")
print(scores)

# And the mean accuracy of all 5 folds:
print("mean of scores:",scores.mean())

scores:
[0.96666667 1.         0.96666667 0.96666667 1.        ]
mean of scores: 0.9800000000000001


## Use K-Fold With Different Variables
Our model is pretty great.  
Here, using a `poly` kernel value

In [8]:
polyKernel = 'poly'
clf = svm.SVC(kernel=polyKernel, C=1)
scores = cross_val_score(clf, irisData.data, irisData.target, cv=k)
print(scores)
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001



The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [9]:
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel=polyKernel, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)   

0.9

## Comparing Polynomial Degress

In [11]:
clfTwo = svm.SVC(kernel=polyKernel, degree=2, C=1).fit(X_train, y_train)
clfFour = svm.SVC(kernel=polyKernel, degree=4, C=1).fit(X_train, y_train)

# Now measure its performance with the test data
print(f'clfTwoScore: {clfTwo.score(X_test, y_test)}')
print(f'clfFourScore: {clfFour.score(X_test, y_test)}')

clfTwoScore: 0.95
clfFourScore: 0.9166666666666666
