# Cross validation:

- If we split data using train_test_split, we can only train a model with the portion set aside for training. The models get better as the amount of training data increases. One solution to overcome this issue is cross validation.
- Cross-validation is a statistical method used to estimate the skill of machine learning models.
- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
- The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
- Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- There are different methods to split data in cross validation. KFold and StratifiedKFold are commonly used.
- It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
- The general procedure is as follows:

    * Shuffle the dataset randomly.
    * Split the dataset into k groups
    * For each unique group:
        * Take the group as a hold out or test data set
        * Take the remaining groups as a training data set
        * Fit a model on the training set and evaluate it on the test set
        * Retain the evaluation score and discard the model
    * Summarize the skill of the model using the sample of model evaluation scores


## K-Fold

- K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.
For Python code: https://machinelearningmastery.com/k-fold-cross-validation/
- In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.
- We average (remember this!)
- Sklearn provides library for this: **sklearn.model_selection.KFold**


In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

target = np.ones(24)
target[-5:] = 0
target

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0.])

In [60]:
df = pd.DataFrame({'col_a':np.random.random(24), 'target':target})
df

Unnamed: 0,col_a,target
0,0.005092,1.0
1,0.176771,1.0
2,0.841139,1.0
3,0.730846,1.0
4,0.880303,1.0
5,0.546661,1.0
6,0.409237,1.0
7,0.004604,1.0
8,0.142923,1.0
9,0.83248,1.0


In [61]:
# Now we split our dataset
X = df.col_a
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

TRAIN: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype='int64') TEST: Int64Index([19, 20, 21, 22, 23], dtype='int64')


In [62]:
# The default value of shuffle above is True so data will be randomly splitted if we do not specify shuffle parameter.
# If we want the splits to be reproducible, we also need to pass in an integer to random_state parameter.
# Otherwise, each time we run train_test_split, different indices will be splitted into training and test set.
# Please note that the numbers seen in the outputs are indices of data points, not the actual values.
from sklearn.model_selection import KFold
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] TEST: [0 1 2 3 4 5]
TRAIN: [ 0  1  2  3  4  5 12 13 14 15 16 17 18 19 20 21 22 23] TEST: [ 6  7  8  9 10 11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 18 19 20 21 22 23] TEST: [12 13 14 15 16 17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20 21 22 23]


In [63]:
# If shuffle is set to True, then the splitting will be random.
kf = KFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [ 0  1  2  4  5  6  7  8  9 10 11 12 15 16 19 21 22 23] TEST: [ 3 13 14 17 18 20]
TRAIN: [ 0  1  3  5  8  9 11 12 13 14 15 16 17 18 20 21 22 23] TEST: [ 2  4  6  7 10 19]
TRAIN: [ 2  3  4  5  6  7  8  9 10 11 12 13 14 17 18 19 20 22] TEST: [ 0  1 15 16 21 23]
TRAIN: [ 0  1  2  3  4  6  7 10 13 14 15 16 17 18 19 20 21 23] TEST: [ 5  8  9 11 12 22]


## StratifiedKFold


In [64]:
# For this new dataset
target = np.ones(16)
target[-4:] = 0
target
df1 = pd.DataFrame({'col_a':np.random.random(16), 'target':target})
df1

Unnamed: 0,col_a,target
0,0.232485,1.0
1,0.678654,1.0
2,0.248848,1.0
3,0.608552,1.0
4,0.570557,1.0
5,0.40432,1.0
6,0.368747,1.0
7,0.714599,1.0
8,0.200737,1.0
9,0.394102,1.0


- StratifiedKFold takes the cross validation one step further. The class distribution in the dataset is preserved in the training and test splits.
- There are 16 data points. 12 of them belong to class 1 and remaining 4 belong to class 0 so this is an imbalanced class distribution. KFold does not take this into consideration. Therefore, in classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold.
- The ratio of class 0 to class 1 is 1/3. If we set k=4, then the test sets include three data points from class 1 and one data point from class 0. Thus, training sets include three data points from class 0 and nine data points from class 1.


In [65]:
# Now we split our dataset
X = df1.col_a
y = df1.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

TRAIN: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64') TEST: Int64Index([12, 13, 14, 15], dtype='int64')


In [66]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y): #  Split happens here on both X and y
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 3  4  5  6  7  8  9 10 11 13 14 15] TEST: [ 0  1  2 12]
TRAIN: [ 0  1  2  6  7  8  9 10 11 12 14 15] TEST: [ 3  4  5 13]
TRAIN: [ 0  1  2  3  4  5  9 10 11 12 13 15] TEST: [ 6  7  8 14]
TRAIN: [ 0  1  2  3  4  5  6  7  8 12 13 14] TEST: [ 9 10 11 15]


In [67]:
# The indices of class 0 are 12, 13, 14, and 15. 
# As we can see, the class distribution of the dataset is preserved in the splits.
# We can also use shuffling with StratifiedKFold:

In [68]:
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 1  2  3  5  7  8  9 10 11 12 14 15] TEST: [ 0  4  6 13]
TRAIN: [ 0  3  4  5  6  7  8  9 10 12 13 15] TEST: [ 1  2 11 14]
TRAIN: [ 0  1  2  3  4  6  8  9 11 12 13 14] TEST: [ 5  7 10 15]
TRAIN: [ 0  1  2  4  5  6  7 10 11 13 14 15] TEST: [ 3  8  9 12]


## cross_val_score
- Evaluates a score by cross-validation
- **cross_val_score** takes the dataset and applies cross validation to split the data. Then, train a model using the specified estimator (e.g. logistic regression, decision tree, …) and measure the performance of the model (scoring parameter).

In [69]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()

In [70]:
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))

[0.33150734 0.08022311 0.03531764]


## Leave-One-Out cross-validator
- Provides train/test indices to split data in train/test sets. Each sample is used once as a test set (singleton) while the remaining samples form the training set.
- Note: LeaveOneOut() is equivalent to KFold(n_splits=n) and LeavePOut(p=1) where n is the number of samples.
- Due to the high number of test sets (which is the same as the number of samples) this cross-validation method can be very costly. For large datasets one should favor KFold, ShuffleSplit or StratifiedKFold.


In [71]:
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
loo.get_n_splits(X)

2

In [72]:
for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]


## ShuffleSplit

- ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.
- It works iteratively.
- It is a random permutation cross-validator.
- Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

In [73]:
from sklearn.model_selection import ShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
y = np.array([1, 2, 1, 2, 1, 2])
rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs

ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)

In [74]:
rs.get_n_splits(X)

5

In [75]:
for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]
