# Cross validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

Load the iris dataset and define a Logistic Regression classification model

In [None]:
iris = load_iris()

log_reg = LogisticRegression(
                        random_state=1, 
                        C=200,
                        solver='lbfgs',
                        multi_class='auto',
                        max_iter=10000)

To compute the accuracy of the model a cross validation score is now used

In [None]:
scores = cross_val_score(estimator=log_reg, # model
                         X=iris.data, y=iris.target, # X, y
                         cv=5,       #number of folds - default 5-fold cross validation (see alternatives in documentation)
                         n_jobs=-1,  # use all CPU
                         verbose=1,   # verbose level 
                        )

Note that for LogisticRegression, Score is the mean accuracy on the given test data and labels.

(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score)

In [None]:
scores

Average CV score - we can conclude that we expect the model to be around x% accurate on average

In [None]:
scores.mean()

And the standard deviation

In [None]:
scores.std()

Having multiple splits of the data also provides some information about how sensitive our model is to the selection of the training dataset. For the iris dataset, we saw accuracies between ~90% and 100%. This is quite a range, and it provides us with an idea about how the model might perform in the worst case and best case scenarios when applied to new data.

The main disadvantage of cross-validation is increased computational cost. As we are now training k models instead of a single model, cross-validation will be roughly $k$ times slower than doing a single split of the data.

# Stratified k-Fold Cross-Validation and Other Strategies

## k-fold's

Splitting the dataset into k folds by starting with the first one-k-th part of the data, as
described in the previous section, might not always be a good idea.

How would a 3 fold CV work over a dataset with the data ordered as in the iris dataset!?

In [None]:
iris.target

Note that `cross_val_score` by default, classification, uses **stratified k-fold cross-validation**: split the data such that the proportions between classes are the same in each fold as they are in the whole dataset. So the results were not so bad as using standard CV.

Let us see how bad it could be (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html):

In [None]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=3, 
              shuffle=False)

cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=kfold #  int, cross-validation generator or an iterable, optional
               )

0! Why!!????? Can you guess?

with a 5-fold we have some improvements, but...

In [None]:
kfold = KFold(n_splits=5, 
              shuffle=False)

cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=kfold #  int, cross-validation generator or an iterable, optional
               )

But, Kfold class allows us to shuffle the data
(https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

In [None]:
kfold = KFold(n_splits=3, 
              shuffle=True, 
              random_state=0)

cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=kfold 
               )

## Stratified k-fold

Basically, this does the same as the first CV computation, but with stratified data

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold()

score = cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=skf #  int, cross-validation generator or an iterable, optional
               )
score

## Leave-one-out CV

Leave-one-out cross-validation:  is as $k$-fold cross-validation where each fold is a single sample. For each split, you pick a single data point to be the test set. 
        
This can be very time consuming, particularly for large datasets, but sometimes provides better estimates on small datasets

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut


In [None]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

score = cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=loo, #  int, cross-validation generator or an iterable, optional
                verbose=True,
                        n_jobs=-1
               )
score

In [None]:
len(score)

In [None]:
score.mean()

In [None]:
score.std()

In [None]:
plt.plot(score)
plt.show()

## Shuffle-split cross-validation

In shuffle-split cross-validation, each split samples `train_size` many points for the training set and `test_size` many (disjoint) point for the test set.
This splitting is repeated `n_iter` times. 

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit

In [None]:
from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=10, 
                  train_size=.75, 
                  #     test_size wil be complement of the train_size
                  random_state=1)

score = cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target,  
                cv=ss)
score

In [None]:
score.mean()

## Stratified Shuffle-split cross-validation
There is also a stratified variant of ShuffleSplit, aptly named StratifiedShuffleSplit, which can provide more reliable results for classification tasks.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(
    n_splits=10, 
    train_size=.75, 
#     test_size wil be complement of the train_size
    random_state=1)

score = cross_val_score(estimator=log_reg, 
                X=iris.data, 
                y=iris.target, 
                cv=sss)
score

In [None]:
score.mean()

See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection for other options