### Cross validation

**Cross validation** is a modeling technique for assessing how the statistical model will generalize to out of sample. **Cross validation** is also used to search / select model hyperparameters.

There are various **Cross Validation** techniques:
1. Holdout cross validation.
2. k-fold cross validation.
3. nested cross validation.

#### Holdout cross validation
**Holdout cross validation** is an algorithm where data is split randomly in test and train set. Train set is used to train the model and test set is used to assess its performance on out of sample. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing. This technique is dependent only on one test-train split that makes the results more dependent on the split.

#### k-fold cross validation
**k-fold cross validation** is an algorithm where training data is randomly divided into **k** groups of equal sizes. Then the algorithm is ran **k** times, each time one of the group is used as a validation set and rest of the groups are used as train set. After the procedure results of **k** different runs are averaged. When **k** is high the model will have low bias but high variance, when **k** is low model will have high bias but low variance. Generally choice of **5 or 10** is made. Also, when **k = N** (number of samples), its a special case called **leave one out cross validation**. It has low bias but since all the **k folds** are similar can have potentially high variance. One thing to note is in a multi-step modeling procedure cross-validation must be applied to the entire sequence of modeling steps.

#### nested cross validation
When we use **k-fold cross validation** set for both model selection and estimation of test error, that has an issue of over fitting the test set and giving overly optimistic test set error. This problem in CV is usually solved by nested cross validation. **Nested cross validation** has an inner loop nested in an outer loop. The inner loop is responsible for model selection/hyperparameter tuning (similar to validation set), while the outer loop is for error estimation (test set).

Below we will demonstrate the use of the above methods.

In [1]:
#Lets apply hold out method 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#Make the sheet width 100%
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>" ))

In [2]:
#Plot decision regions - From Python Machine Leaning - Sebastian Raschka & Vahid Mirjalili
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt


def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx], 
                    label=cl, 
                    edgecolor='black')
    # highlight test examples
    if test_idx:
        # plot all examples
        X_test, y_test = X[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0],
                    X_test[:, 1],
                    c='',
                    edgecolor='black',
                    alpha=1.0,
                    linewidth=1,
                    marker='o',
                    s=100, 
                    label='test set')

In [3]:
#import iris dataset
from sklearn import datasets

iris = datasets.load_iris()
#load only two features
X = iris.data[:,[2,3]]
y = iris.target

from sklearn.model_selection import train_test_split
X_train_v, X_test, y_train_v, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1, stratify=y)


#Now lets again divide train set in two to get a validation set
X_train, X_validation, y_train, y_validation = train_test_split(
    X_train_v, y_train_v, test_size=0.25, random_state=1, stratify=y_train_v)

print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_test:', np.bincount(y_test))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_validation:', np.bincount(y_validation))

Labels counts in y: [50 50 50]
Labels counts in y_test: [10 10 10]
Labels counts in y_train: [30 30 30]
Labels counts in y_validation: [10 10 10]


In [4]:
#Now we will train the data on train set and 
#will tune the regularization parameter on validation set 
#and then measure out of sample performance on test set

#Lets first build a pipeline 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state=1, solver='lbfgs'))
pipe_lr.fit(X_train, y_train)
y_train_pred = pipe_lr.predict(X_train)
print(f"Train accuracy = {pipe_lr.score(X_train, y_train):0.2}")
print(f"Validation accuracy = {pipe_lr.score(X_validation, y_validation):0.2}")
print(f"Test accuracy = {pipe_lr.score(X_test, y_test):0.2}")


param_range = [1e-4, 1e-3, 1e-2, 1e-1, 1, 2, 5, 10, 100, 1000, 1e5]
accuracy = dict()
for C in param_range:
    pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state=1, solver='lbfgs', C=C))
    pipe_lr.fit(X_train,y_train)
    accuracy[C] = pipe_lr.score(X_validation, y_validation)
    print(f"For C={C} accuracy={accuracy[C]}")
    
#Best accuracy if for C=1
pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state=1, solver='lbfgs', C=1))
pipe_lr.fit(X_train, y_train)
y_train_pred = pipe_lr.predict(X_train)
print(f"Test accuracy = {pipe_lr.score(X_test, y_test):0.2}")


Train accuracy = 0.97
Validation accuracy = 0.93
Test accuracy = 0.97
For C=0.0001 accuracy=0.6666666666666666
For C=0.001 accuracy=0.6666666666666666
For C=0.01 accuracy=0.8
For C=0.1 accuracy=0.9
For C=1 accuracy=0.9333333333333333
For C=2 accuracy=0.9333333333333333
For C=5 accuracy=0.9
For C=10 accuracy=0.9
For C=100 accuracy=0.9
For C=1000 accuracy=0.9
For C=100000.0 accuracy=0.9
Test accuracy = 0.97


In [5]:
#Now lets repeat the process using k-fold cross validation
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train_v, y_train_v)
scores = []
pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state=1, solver='lbfgs', C=1))
for k, (train , test) in enumerate(kfold):
    pipe_lr.fit(X_train_v[train], y_train_v[train])
    score = pipe_lr.score(X_train_v[test], y_train_v[test])
    scores.append(score)
    print(f'Fold: {k}, Acc : {score :.3}')
print(f'Mean CV accuracy = {np.mean(scores):.2}, std = {np.std(scores):.2}')

Fold: 0, Acc : 0.917
Fold: 1, Acc : 0.917
Fold: 2, Acc : 1.0
Fold: 3, Acc : 0.917
Fold: 4, Acc : 1.0
Fold: 5, Acc : 0.917
Fold: 6, Acc : 0.917
Fold: 7, Acc : 1.0
Fold: 8, Acc : 1.0
Fold: 9, Acc : 0.917
Mean CV accuracy = 0.95, std = 0.041


In [6]:
#Now lets use GridSearchCV to tune the hyper parameter C
from sklearn.model_selection import GridSearchCV
pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state=1, solver='lbfgs'))
param_range = [1e-4, 1e-3, 1e-2, 1e-1, 1, 2, 5, 10, 100, 1000, 1e5]
param_grid = [{'logisticregression__C': param_range}]
gs = GridSearchCV(estimator=pipe_lr, param_grid=param_grid, scoring='accuracy', cv=10, refit=True)
gs = gs.fit(X_train_v, y_train_v)
print(f"Best score = {gs.best_score_:.3}")
print(gs.best_params_)

Best score = 0.958
{'logisticregression__C': 5}


In [7]:
clf = gs.best_estimator_
clf.fit(X_train_v, y_train_v)
print(f"Test accuracy {clf.score(X_test, y_test):.2}")

Test accuracy 0.97


Best model selected with C = 5