<a href="https://colab.research.google.com/github/ishandahal/ml_model_evaluation/blob/main/Cross_validation_model_selection_KfoldCV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install watermark

In [7]:
%load_ext watermark
%watermark -a 'ishan dahal' -u -d -v -p numpy,mlxtend,matplotlib,sklearn

ishan dahal 
last updated: 2020-11-23 

CPython 3.6.9
IPython 5.5.0

numpy 1.18.5
mlxtend 0.14.0
matplotlib 3.2.2
sklearn 0.0


In [8]:
import numpy as np
import matplotlib.pyplot as plt

### K-fold Cross-Validation in Scikit-Learn
- Simple demonstration of using cross-validation iterator in scikit-learn

In [13]:
from sklearn.model_selection import KFold

rng = np.random.RandomState(123)

y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
X = rng.random_sample((y.shape[0], 4))

cv = KFold(n_splits=5)

for k in cv.split(X, y):
    print(k)

(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1]))
(array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3]))
(array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5]))
(array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7]))
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))


In practice we shuffle the dataset, because if the class labels are ordered it can lead to classes not well represented in the training and test set folds.

In [16]:
cv = KFold(n_splits=5, random_state=123, shuffle=True)

for k in cv.split(X, y):
    print(k)

(array([1, 2, 3, 5, 6, 7, 8, 9]), array([0, 4]))
(array([0, 1, 2, 3, 4, 6, 8, 9]), array([5, 7]))
(array([0, 1, 2, 4, 5, 6, 7, 9]), array([3, 8]))
(array([0, 2, 3, 4, 5, 7, 8, 9]), array([1, 6]))
(array([0, 1, 3, 4, 5, 6, 7, 8]), array([2, 9]))


KFold iterator only provides us with indices; in practice we are interested in the values of the features and labels

In [21]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)

for train_idx, valid_idx in cv.split(X, y):
    print(f"shuffled training labels {y[valid_idx]}")

shuffled training labels [0 0]
shuffled training labels [1 1]
shuffled training labels [0 1]
shuffled training labels [0 1]
shuffled training labels [0 1]


Especially for smaller datasets it is crucial to stratify the splits. This ensures the splits are balanced. 

In [24]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)

for train_idx, valid_idx in cv.split(X, y):
    print(f"training labels {y[train_idx]} valid labels {y[valid_idx]}")

training labels [0 0 0 0 1 1 1 1] valid labels [0 1]
training labels [0 0 0 0 1 1 1 1] valid labels [0 1]
training labels [0 0 0 0 1 1 1 1] valid labels [0 1]
training labels [0 0 0 0 1 1 1 1] valid labels [0 1]
training labels [0 0 0 0 1 1 1 1] valid labels [0 1]


- Using the cross validation iterators to fit and evaluate learning algorithms 

In [35]:
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split

X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.15,
                                                    shuffle=True, stratify=y)
cv = StratifiedKFold(n_splits=10, random_state=123, shuffle=True)

kfold_acc = 0.
for train_idx, valid_idx in cv.split(X_train, y_train):
    clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train[train_idx],
                                                                    y_train[train_idx])
    y_pred = clf.predict(X_train[valid_idx])
    acc = np.mean(y_pred == y_train[valid_idx])*100
    kfold_acc += acc
kfold_acc /= 10

clf = DecisionTreeClassifier(random_state=123, max_depth=3).fit(X_train, y_train)
y_pred = clf.predict(X_test)
test_acc = np.mean(y_pred == y_test)*100

print(f"Kfold Accuracy: {kfold_acc:.2f}%")
print(f"Test Accuracy: {test_acc:.2f}%")

Kfold Accuracy: 95.26%
Test Accuracy: 95.65%


- ```cross_val_score``` is a more convenient way to do the above. It uses stratify as default.


In [49]:
from sklearn.model_selection import cross_val_score

cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=-1)
print(f"Kfold Accuracy: {np.mean(cv_acc)*100:.2f}")

Kfold Accuracy: 96.09


- We can provide our own cross-validation iterator for convenience.

In [50]:
from sklearn.model_selection import cross_val_score

cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=StratifiedKFold(n_splits=10, random_state=123, shuffle=True),
                         n_jobs=-1)
print(f"Kfold Accuracy: {np.mean(cv_acc)*100:.2f}")

Kfold Accuracy: 95.26


### Bootstrap
Using Bootstrap samples analogous to KFold

In [47]:
from mlxtend.evaluate import BootstrapOutOfBag

oob = BootstrapOutOfBag(n_splits=5, random_seed=99)
for train, test in oob.split(np.array([1, 2, 3, 4, 5])):
    print(train, test)

[1 3 1 0 1] [2 4]
[0 2 4 4 1] [3]
[3 1 1 0 3] [2 4]
[3 4 2 0 4] [1]
[0 0 4 1 3] [2]


Analogous to ```KFold``` iterator we can use bootstrap out of bag in ```cross_val_score``` 

In [52]:
cv_acc = cross_val_score(estimator=DecisionTreeClassifier(random_state=99, max_depth=3),
                         X=X_train,
                         y=y_train,
                         cv=BootstrapOutOfBag(n_splits=5, random_seed=99),
                         n_jobs=-1)
print(f"Ffold Accuracy: {np.mean(cv_acc)*100:.2f}")

Ffold Accuracy: 96.75


Similar ```cross_val_score``` we can use ```bootstrap_point632_score```, which implements the .632-Bootstrap method which is less pessimistically biased than out-of-bag bootstrap

In [55]:
from mlxtend.evaluate import bootstrap_point632_score

cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  random_seed=99)
print(f"OOB Bootstrap Accuracy: {np.mean(cv_acc)*100:.2f}")

OOB Bootstrap Accuracy: 96.15


By default, ```bootstrap_point632_score uses the setting method='.632'
By setting method='.632+', we can also perform .632+ bootstrap which corrects for the optimism bias (shown below)

In [56]:
cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  method='.632+',
                                  random_seed=99)
print(f"OOB Bootstrap Accuracy: {np.mean(cv_acc)*100:.2f}")

OOB Bootstrap Accuracy: 96.02


We can run a regular bootstrap by setting method to 'oob'.

In [57]:
cv_acc = bootstrap_point632_score(estimator=DecisionTreeClassifier(random_state=123, max_depth=3),
                                  X=X_train,
                                  y=y_train,
                                  method='oob',
                                  random_seed=99)
print(f"OOB Bootstrap Accuracy: {np.mean(cv_acc)*100:.2f}")

OOB Bootstrap Accuracy: 94.77
