# Cross Validation

This notebook implements cross validation, a technique used in machine learning to evaluate the performance of a model on a limited dataset. The goal of cross validation is to estimate how well the model is likely to perform on new, unseen data. The basic idea is to split the available data into several parts, or folds, and train and test the model on different combinations of these folds. By averaging the performance across the different folds, we can get a more reliable estimate of the model's performance than if we had only trained and tested it on a single split of the data.

The most common form of cross validation is k-fold cross validation, where the data is divided into k roughly equal parts. The model is then trained on k-1 of these folds and tested on the remaining fold. This process is repeated k times, with each of the k folds used once as the test set. The results of the k tests are averaged to produce an overall estimate of the model's performance. Cross validation can help to reduce the risk of overfitting, where the model fits too closely to the training data and performs poorly on new, unseen data.

---

First, load the relevant libraries needed.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Import functions for modeling and evaluating performance
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score

## The Data

The model will be trained using the [Hawks](https://github.com/kary5678/INDE-577/blob/main/Data/hawks.csv) dataset. This dataset contains observations for three species of hawks, and attributes such as age, sex, wing length, body weight, tail length, etc. 

The code block below reads the dataset into a pandas DataFrame object, subsets the DataFrame to the relevant variables, and drops any rows where there are missing values for these relevant variables.

In [2]:
# Read in the data and subset it to the relevant columns/observations
hawks = pd.read_csv("../../Data/hawks.csv")
hawks = hawks[["Species", "Wing", "Tail", "Weight", "Culmen", "Hallux"]].dropna(axis=0)
hawks

Unnamed: 0,Species,Wing,Tail,Weight,Culmen,Hallux
0,RT,385.0,219,920.0,25.7,30.1
2,RT,381.0,235,990.0,26.7,31.3
3,CH,265.0,220,470.0,18.7,23.5
4,SS,205.0,157,170.0,12.5,14.3
5,RT,412.0,230,1090.0,28.5,32.2
...,...,...,...,...,...,...
903,RT,380.0,224,1525.0,26.0,27.6
904,SS,190.0,150,175.0,12.7,15.4
905,RT,360.0,211,790.0,21.9,27.6
906,RT,369.0,207,860.0,25.2,28.0


## Implementation

### Performance of decision tree

To have a point of reference in discussion of cross validation results, first I will implement a decision tree as I did in my [decision trees notebook](https://github.com/kary5678/INDE-577/blob/main/supervised-learning/decision_trees/decision_trees.ipynb), using wing and tail length as predictors for species and the hyperparameter `max_depth=3`.

The processed hawks data is randomly split into a training and testing set using the traditional 80-20 rule of the Pareto Principle. The parameter `random_state=1` is used to ensure that we get the same observations in the training/testing set as in the Hawks exploratory analysis notebook [here](https://github.com/kary5678/INDE-577/blob/main/Data/hawks_analysis.ipynb). We know from the plots in `hawks_analysis.ipynb` that the split using this `random_state` produces a training set that is a good representation for the data being tested (and vice versa).

In [3]:
# Data preparation step
X = hawks[["Wing", "Tail"]]
y = hawks["Species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [4]:
decision_tree = DecisionTreeClassifier(max_depth = 3, random_state = 42)
decision_tree.fit(X_train, y_train)

print(f"Training accuracy: {decision_tree.score(X_train, y_train):.3f}")
print(f"Testing accuracy: {decision_tree.score(X_test, y_test):.3f}")

Training accuracy: 0.986
Testing accuracy: 0.972


The testing accuracy is what will be used as a frame of reference for cross validation to determine whether this model performs well or not.

### Perform k-fold cross validation

I will perform k-fold cross validation with $k=10$, meaning the dataset will be split into 10 equal parts, and the model will be trained and evaluated 10 times. The `shuffle = True` parameter will randomly shuffle the data before splitting it into folds, ensuring that the folds are not biased towards any particular subset of the data.

In [5]:
# Define number of folds for cross validation
n_folds = 10

# Perform cross validation
kf = KFold(n_splits = n_folds, shuffle = True, random_state = 42)
scores = cross_val_score(decision_tree, X, y, cv = kf)
print(scores)

[0.96666667 0.98876404 0.97752809 0.97752809 0.98876404 0.93258427
 0.97752809 0.97752809 0.98876404 0.96629213]


The above array is the accuracy of the model over the 10 folds.

To consolidate this into one metric, output the average accuracy of the model over the 10 folds, along with the 95% confidence interval (2 $\times$ the standard deviation of the accuracy scores).

In [6]:
# Print the average score and standard deviation
print("Accuracy: {:.3f} (+/- {:.3f})".format(scores.mean(), scores.std() * 2))

Accuracy: 0.974 (+/- 0.032)


If the score obtained from k-fold cross validation is the same as the score obtained from the fitted model, then the model is not overfitting or underfitting the data; it is generalizing well to new data. On the other hand, if the score obtained from k-fold cross validation is slightly higher than the testing accuracy, it may indicate that that model is slightly overfit to the training data.

For this decision tree model, the average score of 0.974 is slightly higher than the testing accuracy of 0.972, signaling the decision tree might be overfit to the data. However, considering that 0.972 is within the 0.032 margin, cross validation tells us that the fitted decision tree generalizes well!

In conclusion, cross validation is a useful technique to evaluate the generalized performance of a model by allowing you to assess how well the model performs on new, unseen data, more than the traditional single training and testing set methodology.