# Cross Validation

## A first way to train a supervised learning model

The data used in supervised learning tasks usually contains several features (or independent variables) and a single label (or dependent variable) for a set of observations. <br>

A supervised ML algorithm try to model the relationship between features and label. <br>
We first train the model by providing both features and label for some observations. Then test the model by only providing features and expecting it to predict the labels. Thus, we need to split the data into training and test subsets. <br>
We let the model to learn on training set and then measure its performance on test set.<br>

Scikit-learn library provides several tools to split data into training and test sets.<br>
The most basic one is **`train_test_split()`** which just divides the data into two parts according to the specified partitioning ratio: <br><br>
`train_test_split(test_size=0.2)` will set aside 20% of the data for testing and 80% for training. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'feature':np.random.random(25),
                  'target':target})
X = df.feature
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print("TRAIN:", X_train.index, "TEST:", X_test.index)
print("TRAIN:", X_train, "TEST:", X_test)

TRAIN: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
            19],
           dtype='int64') TEST: Int64Index([20, 21, 22, 23, 24], dtype='int64')
TRAIN: 0     0.066609
1     0.168764
2     0.245469
3     0.912774
4     0.806911
5     0.338475
6     0.771831
7     0.631860
8     0.496587
9     0.150872
10    0.222878
11    0.181049
12    0.040606
13    0.757292
14    0.465324
15    0.880543
16    0.491598
17    0.472164
18    0.048185
19    0.689647
Name: feature, dtype: float64 TEST: 20    0.327801
21    0.692204
22    0.651774
23    0.365881
24    0.828829
Name: feature, dtype: float64


If we set the **`shuffle`** parameter to `True` (the default value), the data will be randomly splitted. <br>

If we want the splits to be reproducible, we also need to pass in an integer to **`random_state`** parameter. Otherwise, each time we run `train_test_split()`, different indices will be splitted into training and test set.<br>

## How to improve the way the model is trained

Data is a valuable asset and we want to make use of every bit of it. If we split data using `train_test_split()`, we can only train a model with the portion set aside for training. <br>
The models get better as the amount of training data increases. <br>

One solution to overcome this issue is **cross validation**. With cross validation, the dataset is divided into N splits. <br>
N-1 splits are used for training and the remaining split is used for testing. The model runs through the entire dataset N times and at each time, a different split is used for testing. Thus, we use all of data points for both training and testing. <br>

Cross validation is also useful to measure the performance of a model more accurately, especially on new, previously unseen data points.<br>

There are different methods to split data in cross validation: **KFold** and **StratifiedKFold** are commonly used.

## KFold

As the name suggests, **KFold** divides the dataset into `k` folds (`k` refers to the number of groups that a given data sample is to be split into). <br>

If **`shuffle`** is set to `False`, consecutive folds will be the shifted version of previous fold. <br>

At first iteration, test set is the first four indices. Then `KFold` keeps shifting the test set `k` times. <br>


In [3]:
from sklearn.model_selection import KFold

X = df.feature
y = df.target

kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] TEST: [0 1 2 3 4 5 6]
TRAIN: [ 0  1  2  3  4  5  6 13 14 15 16 17 18 19 20 21 22 23 24] TEST: [ 7  8  9 10 11 12]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 19 20 21 22 23 24] TEST: [13 14 15 16 17 18]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] TEST: [19 20 21 22 23 24]


If shuffle is set to True, then the splitting will be random.

In [None]:
kf = KFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

We can then make use of the sample (TRAIN), such as to evaluate the skill of a machine learning algorithm.<br>

Here, four models are trained and evaluated with each fold given a chance to be the held out test set.<br>

For example:

Model1: Trained on [ 0  1  2  4  5  6  7  8  9 11 12 15 16 19 20 22 23 24], Tested on [ 3 10 13 14 17 18 21]
<br>
Model2: Trained on [ 0  1  3  5  8  9 10 11 12 13 14 15 16 17 18 21 22 23 24], Tested on [ 2  4  6  7 19 20]<br>
etc ...

The models are then discarded after they are evaluated as they have served their purpose.<br>

The skill scores are collected for each model and summarized for use.<br>

In [1]:
#import datasets from sklearn library
from sklearn import datasets
data = datasets.load_iris()
 
#Import decision tree classification model and cross validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

 
#Get input and output datasets values in X and Y variables
X = data.data
y = data.target
 
#Initialize k-fold cross validation configurations
kf = KFold(n_splits=5, shuffle=True, random_state=42)
 
scores = []
dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, 
                            min_samples_leaf = 0.10, random_state = 42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
 
    dt.fit(X_train, y_train)
    scores.append(dt.score(X_test, y_test))
    
print("\n") 
print("*" * 80)
print(f"The cross-validation scores using custom method are \n{scores}")
print("*" * 80)
 
import numpy as np
print("\n") 
print("*" * 80)
print(f"Mean of k-fold scores using custom method is {np.mean(scores)}")
print("*" * 80)
print("\n")



********************************************************************************
The cross-validation scores using custom method are 
[0.9666666666666667, 0.9666666666666667, 0.9333333333333333, 0.9333333333333333, 0.9333333333333333]
********************************************************************************


********************************************************************************
Mean of k-fold scores using custom method is 0.9466666666666667
********************************************************************************




## StratifiedKFold

**StratifiedKFold** takes the cross validation one step further. The class distribution in the dataset is preserved in the training and test splits.<br>

In our very simple DataFrame, there are 16 data points. <br>
12 of them belong to class 1 and the remaining 4 belong to class 0 so this is an "imbalanced" class distribution. <br>

**KFold** does not take this into consideration.<br> 
Therefore, in classifications tasks with imbalanced class distributions, we should prefer **StratifiedKFold** over **KFold**.<br>


In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

## k-fold cross-validation using built-in cross_val_score method

Instead of using a custom code, we can also use the built-in **`cross_val_score()`** method of the sklearn library. <br>

`cross_val_score()` does the same thing as our previous example: it takes an estimator, the features and target, splits into k-folds (`k` which is the **cv** parameter, 5 by default), fits on the (k-1) folds and evaluates on the last fold.<br> 
It does this `k` times and performance measure such as "accuracy" is computed at each step.<br>

Below is a sample code which shows how we can use this method.

In [2]:
#import datasets from sklearn library
from sklearn import datasets
data = datasets.load_iris()
 
#Import decision tree classification model and cross validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

#Get input and output datasets values in X and Y variables
X = data.data
y = data.target
dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, \
                        min_samples_leaf = 0.10, random_state = 42)
 
scores = cross_val_score(dt, X, y, cv = 5)

print("\n") 
print("*" * 100)
print(f"The cross-validation scores using custom method are \n{scores}")
print("*" * 100)
 
import numpy as np
print("\n") 
print("*" * 100)
print(f"Mean of k-fold scores using custom method is {np.mean(scores)}")
print("*" * 100)
print("\n")



****************************************************************************************************
The cross-validation scores using custom method are 
[0.93333333 0.96666667 0.9        0.86666667 1.        ]
****************************************************************************************************


****************************************************************************************************
Mean of k-fold scores using custom method is 0.9333333333333332
****************************************************************************************************




## How to choose the K value

The **k** value must be chosen carefully for your data sample (a poorly chosen value for k may result in a mis-representative idea of the skill of the model).

Three common tactics for choosing a value for `k` are as follows:

1. The value for `k` is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
2. The value for `k` is fixed to **10**: this is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.
3. The value for `k` is fixed to **n**, where **n** is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. 


**Note**: most of the time, one performs k-fold cross-validation using `k = 5` or `k = 10`, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

**Note**: If a value for `k` is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples: it is preferable to split the data sample into `k` groups with the same number of samples.

**Note**: Be cautious in choosing the value of hyperparameter `k`: as the number of `k` increases, the model becomes more computationally expensive.

## To summarize

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data.<br>

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
    1. Take the group as a hold out or test data set
    2. Take the remaining groups as a training data set
    3. Fit a model on the training set and evaluate it on the test set
    4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores<br>

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.