# Cross Validation (CV)

 - Hold out Cross Validation
 - k-fold Cross Validation
 
A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.


In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k "folds":
 - A model is trained using k-1 of the folds as the training data;
 - the resulting model is validated on the remaining part of the data.
 
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

### Holdout Method

 - Split inital dataest into separate training and test datasest
 - trainign dataset - model training
 - Test dataset = estimate its generalized performance
<br>
<br>
<br>

A variation is to split the training set into two := training set and validation set

Training set :- For fitting different models

Validation set :- For tuning and comparing different parameter settings to further imporve the performance for making predictions on unseen data and final model selection.

This process is called model selection. We want to select the optimal values of tuning parameters.

### k-fold

 - Randomly split the training dataset into k fold without replacememt.
 - k - 1 folds are used for the model training.
 - The one fold is used for performance evaluation.
 
The procedure is repeated k times.

Final outbomes:- k models and performance estimates.
 - calculate the average peprformance of the models based on the differen independent folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the training data compared to the holdout method.
 - k-fold CV is used for model tuning. Finding the optimal hyperparameter values that yields a satisfying generalization performance.
 - Once we have found satisfactory hyperparameter values, we can retrain the model on the complete training set and obtain a final performance estimate using the independent test set. The rationale behind fitting a model to the whole training dataset after k-fold CV is that prividing more training samples to a learning algorithm usually results in a more accurate and robust model.
<br>
<br>

 - Common k is 10.
 - For relatively small training sets, increase the number of folds.
 
### Stratified k-fold cross-validation
 
  - variation of k-fold
  - Can yield better bias and variance estimates, especially in cases of unequal class proportions

***

## Illustration
### Cross-validation: evaluating estimator performance

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

boston = datasets.load_boston()
boston.data.shape, boston.target.shape

((506, 13), (506,))

We can now quickly sample a training set while holding out 40% of the data for testing our regressor:

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.4, random_state=0)

X_train.shape, y_train.shape

X_test.shape, y_test.shape

regression = svm.SVR(kernel='linear', C=1).fit(X_train, y_train)
regression.score(X_test, y_test)

0.6672554157940424

### Computing cross-validated metrics

In [4]:
from sklearn.model_selection import  cross_val_score
regression = svm.SVR(kernel='linear', C=1)
scores = cross_val_score(regression, boston.data, boston.target, cv=5)
scores

array([0.77328953, 0.72833447, 0.53795481, 0.15209389, 0.07729196])

The mean score and the 95% CI of the score estimate are hence given by:

In [5]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.45 (+/- 0.58)


By default, the score computer at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

In [6]:
from sklearn import metrics
scores = cross_val_score(regression, boston.data, boston.target, cv=5, scoring='neg_mean_squared_error')
scores

array([ -7.82949025, -24.73154773, -37.00390719, -74.37141515,
       -24.53325372])

### K-fold

KFol divides all the sameples in k groups of samples, called folds, of equal sizes (if possible). The prediction function is learned using k-1 folds, and the fold left out is used for test.

Example of 2-fold CV on a dataset with 4 samples:

In [7]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


### Stratified k-fold

StraifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Example of stratified 3-fold CV on a dataset with 10 sameples from two slightly unbalanced classes:

In [10]:
from sklearn.model_selection import  StratifiedKFold

X= np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.pipeline import make_pipeline

pipe_svm = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        svm.SVR(kernel='linear', C=1))
pipe_svm.fit(X_train, y_train)
y_pred = pipe_svm.predict(X_test)
print('Test Accuracy: %.3f' % pipe_svm.score(X_test, y_test))

Test Accuracy: 0.391


In [16]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_svm,
                        X=X_train, y=y_train,
                        cv=10, n_jobs=1)
print('CV accuracy scores: %s' % scores)

CV accuracy scores: [0.63971176 0.43579197 0.46977821 0.25027246 0.5124364  0.26221374
 0.30877195 0.54528563 0.37810066 0.47313549]


In [18]:
print('CV accuracy: %.3f =/- %.3f' % (np.mean(scores),
                                     np.std(scores)))

CV accuracy: 0.428 =/- 0.121
