### Cross Validation

When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. We can perform cross validation to correct this.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned.

To better understand Cross Validation, we will be performing different methods on the iris dataset. Let us first load in and separate the data.

In [1]:
from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

#print (X)

#print (y)

There are many methods for cross validation, let's start by looking at k-fold cross validation.

### K-Fold

The training data used in the model is split into k number of smaller sets to be used to validate the model. The model is then trained on k-1 folds of training set. The remaining fold is then used as a validation set to evaluate the model.

K-Folds cross-validator provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

As we will be trying to classify different species of iris flowers, we will need to import a classifier model. For this exercise we will be using a DecisionTreeClassifier. We will also need to import CV modules from sklearn.

In [2]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import KFold, cross_val_score

# With the data loaded we can now create and fit a model for evaluation.

clf = DecisionTreeClassifier(random_state=42) # To obtain a deterministic behaviour during fitting,
                                              # random_state has to be fixed to an integer.
    
# Now let's evaluate our model and see how it performs on each k-fold.

k_folds = KFold (n_splits = 5)

scores = cross_val_score (clf, X, y, cv = k_folds) # Evaluate a score by cross-validation.
                                                   
                        # clf is the estimator
                        # X: The data to fit. Can be a list, or an array.
                        # y: The target variable to try to predict in the case of supervised learning.
                        # cv: Determines the cross-validation splitting strategy.
                        # Returns an array of scores of the estimator for each run of the cross validation.
                    
# It is also good pratice to see how Cross Validation performed overall by averaging the scores for all folds.

print("Cross Validation Scores: ", scores)
print("Average Cross Validation Score: ", scores.mean())
print("Number of Cross Validation Scores used in Average: ", len(scores))

Cross Validation Scores:  [1.         1.         0.83333333 0.93333333 0.8       ]
Average Cross Validation Score:  0.9133333333333333
Number of Cross Validation Scores used in Average:  5


The following link has useful inormation about cross validation:    
https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation 

### Stratified K-Fold

In cases where classes are imbalanced we need a way to account for the imbalance in both the train and validation sets. To do so we can stratify the target classes, meaning that both sets will have an equal proportion of all classes.

In [3]:
from sklearn.model_selection import StratifiedKFold

sk_folds = StratifiedKFold(n_splits = 5)

scores = cross_val_score(clf, X, y, cv = sk_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [0.96666667 0.96666667 0.9        0.93333333 1.        ]
Average CV Score:  0.9533333333333334
Number of CV Scores used in Average:  5


While the number of folds is the same, the average CV increases from the basic k-fold when making sure there is stratified classes.

### Leave-One-Out (LOO)

Instead of selecting the number of splits in the training data set like k-fold, LeaveOneOut utilizes 1 observation to validate and n-1 observations to train. This method is an exaustive technique.

LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. This cross-validation procedure does not waste much data as only one sample is removed from the training set.

In [4]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

scores = cross_val_score(clf, X, y, cv = loo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Average CV Score:  0.94
Number of CV Scores used in Average:  150


We can observe that the number of cross validation scores performed is equal to the number of observations in the dataset. In this case there are 150 observations in the iris dataset.

The average CV score is 94%.

### Leave-P-Out (LPO)

Leave-P-Out is simply a nuanced diffence to the Leave-One-Out idea, in that we can select the number of p to use in our validation set.

LeavePOut is very similar to LeaveOneOut as it creates all the possible training/test sets by removing p samples from the complete set.

In [5]:
from sklearn.model_selection import LeavePOut

lpo = LeavePOut(p=2)

scores = cross_val_score(clf, X, y, cv = lpo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. ... 1. 1. 1.]
Average CV Score:  0.9382997762863534
Number of CV Scores used in Average:  11175


As we can see this is an exhaustive method we many more scores being calculated than Leave-One-Out, even with a p = 2, yet it achieves roughly the same average CV score.

### Shuffle Split

Unlike KFold, ShuffleSplit leaves out a percentage of the data, not to be used in the train or validation sets. To do so we must decide what the train and test sizes are, as well as the number of splits.

In [6]:
from sklearn.model_selection import ShuffleSplit

ss = ShuffleSplit(train_size=0.6, test_size=0.3, n_splits = 5)

scores = cross_val_score(clf, X, y, cv = ss)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [0.91111111 0.91111111 0.93333333 0.97777778 0.97777778]
Average CV Score:  0.9422222222222223
Number of CV Scores used in Average:  5


### Ending Notes

These are just a few of the CV methods that can be applied to models. There are many more cross validation classes. Check out sklearns cross validation for more CV options at the following link 
https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation.