# K-Fold Cross Validation

- https://www.youtube.com/watch?v=kituDjzXwfE

K-Fold Cross-Validation is a powerful technique for assessing the performance of a machine learning model. It involves splitting the dataset into k subsets (folds) and training the model k times, each time using a different fold as the validation set and the remaining k-1 folds as the training set. The performance metric is averaged across all k trials to get a more robust estimate of the model's performance.

K-Fold Cross-Validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter, k, that refers to the number of groups that a given data sample is to be split into.

The process is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups.
3. For each unique group:
- Take the group as a holdout or test data set.
- Take the remaining groups as a training data set.
- Fit a model on the training set and evaluate it on the test set.
- Retain the evaluation score and discard the model.
4. Summarize the skill of the model using the sample of model evaluation scores.

![](kfold.png)

In [12]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

X.shape, y.shape

((150, 4), (150,))

In [15]:
# Create the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svc_clf = SVC()
dt_clf = DecisionTreeClassifier()
gp_clf = GaussianProcessClassifier()

In [16]:
# Define the K-Fold Cross-Validation
# shuffle=True parameter ensures that the data is shuffled before splitting into folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [19]:
# Perform K-Fold Cross-Validation
# cv_scores array contains the accuracy scores for each fold
cv_scores_rf = cross_val_score(rf_clf, X, y, cv=kf)
cv_scores_svc = cross_val_score(svc_clf, X, y, cv=kf)
cv_scores_dt = cross_val_score(dt_clf, X, y, cv=kf)
cv_scores_gp = cross_val_score(gp_clf, X, y, cv=kf)

cv_scores_modelnames = {"rf": cv_scores_rf, "svc": cv_scores_svc, "dt": cv_scores_dt, "gp": cv_scores_gp}

In [21]:
# Print the results
for key, value in cv_scores_modelnames.items():
    print(f'{key} Cross-Validation Scores: {value}')
    print(f'{key} Mean CV Score: {np.mean(value):.2f}')
    print(f'{key} Standard Deviation of CV Scores: {np.std(value):.2f}\n')

rf Cross-Validation Scores: [1.         0.96666667 0.93333333 0.93333333 0.96666667]
rf Mean CV Score: 0.96
rf Standard Deviation of CV Scores: 0.02

svc Cross-Validation Scores: [1.         1.         0.93333333 0.93333333 0.96666667]
svc Mean CV Score: 0.97
svc Standard Deviation of CV Scores: 0.03

dt Cross-Validation Scores: [1.         0.96666667 0.93333333 0.93333333 0.93333333]
dt Mean CV Score: 0.95
dt Standard Deviation of CV Scores: 0.03

gp Cross-Validation Scores: [1.         1.         0.93333333 0.96666667 0.96666667]
gp Mean CV Score: 0.97
gp Standard Deviation of CV Scores: 0.02

