# Cross Validation

* Hold out Cross Validation
* k-fold Cross Validation

A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. 

In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:
* A model is trained using k-1 of the folds as training data;
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. 

## Holdout Method

* Split initial dataset into a separate training and test dataset
* Training dataset - model training
* Test dataset - estimate its generalisation performance

![img](holdout_method.PNG)

## Holdout Method with Validation

A variation is to split the training set to two :- training set and validation set

Training set:- For fitting different models

Validation set :- For tuning and comparing different parameter settings to further improve the performance for making predictions on unseen data. And finally for model selection.

This process is called model selection. We want to select the optimal values of tuning parameters (also called hyperparameters). 


<img style="float: left;" src="holdout_method_w_validation.PNG" height=75%, width=75%>

## K-fold Cross-Validation

* Randomly split the training dataset into k folds without replacement.

* k — 1 folds are used for the model training.

* The one fold is used for performance evaluation. 

This procedure is repeated k times. 

Final outcomes:- k models and performance estimates.

* calculate the average performance of the models based on the different, independent folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the training data compared to the holdout method. 

* k-fold cross-validation is used for model tuning. Finding the optimal hyperparameter values that yields a satisfying generalization performance.

* Once we have found satisfactory hyperparameter values, we can retrain the model on the complete training set and obtain a final performance estimate using the independent test set. The rationale behind fitting a model to the whole training dataset after k-fold cross-validation is that providing more training samples to a learning algorithm usually results in a more accurate and robust model.


* Common k is 10

* For relatively small training sets, increase the number of folds. 

<img style="float: left;" src="cross_validation.PNG" height=75%, width=75%>

## Stratified k-fold cross-validation



* variation of k-fold
* Can yield better bias and variance estimates, especially in cases of unequal class proportions

## Illustration

### Cross-validation: evaluating estimator performance

Adapted from [scikit learn](Cross-validation: evaluating estimator performance)

**overfitting**:
* It is a mistake to expose your machine learning algorithm to both training and testing data
* This will lead to overfitting
* It will give a high score 
* Utterly useless for unseen data

**Note**:

* Hyperparameters for estimators, such as the `C` for SVM, must be set manually
* There is still a risk of overfitting on the test set because one can continually tweek the parameters
* To avoid this, another part of the dataset should be held out as “validation set”
    1. Training proceeds on the training set
    2. Evaluation is done on the validation set
    3. Final evaluation can be done on the test set.

* This raised another issues as we have drastically reduced the number of samples which can be used for learning the model
* To get around this, we utilise a procedure called cross-validation (CV). 
* A test set should still be held out for final evaluation
* The validation set is no longer needed when doing CV. 
* In k-fold CV, the training set is split into k smaller sets
* The following procedure is followed for each of the k “folds”:
    * A model is trained using k-1 of the folds as training data
    * The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
    * The performance measure reported by k-fold cross-validation is then the average of the values. 
* Can be computationally expensive
* Does not waste too much data 


**Best Practice**:
* Hold out part of the available data as a **test set** `X_test`, `y_test`. 

In **scikit-learn** a random split into training and test sets can be quickly computed with the `train_test_split` helper function. 



In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

boston = datasets.load_boston()
boston.data.shape, boston.target.shape


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

((506, 13), (506,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our regressor:

In [2]:
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, 
                                                    test_size=0.4, random_state=0)
regression = svm.SVR(kernel='linear', C=1).fit(X_train, y_train)
print(regression.score(X_test, y_test))

0.6674306943408028


### Computing cross-validated metrics

5 fold cross validation

In [3]:
from sklearn.model_selection import cross_val_score
regression = svm.SVR(kernel='linear', C=1)
scores = cross_val_score(regression, boston.data, boston.target, cv=5)
scores        

array([0.77279739, 0.72778206, 0.56131914, 0.15056404, 0.08212111])

The mean score and the 95% confidence interval of the score estimate are hence given by:

In [4]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.46 (+/- 0.58)


By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

In [5]:
from sklearn import metrics
scores = cross_val_score(
    regression, boston.data, boston.target, cv=5, scoring='neg_mean_squared_error')
scores 

array([ -7.84648637, -24.78183773, -35.13272326, -74.50560108,
       -24.40485462])

See [The scoring parameter: defining model evaluation rules](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for details. In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are almost equal.

When the `cv` argument is an integer, `cross_val_score` uses the `KFold` or `StratifiedKFold` strategies by default, the latter being used if the estimator derives from ClassifierMixin.

### K-fold

`KFold` divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). 

The prediction function is learned using k - 1 folds, and the fold left out is used for test.

Example of 2-fold cross-validation on a dataset with 4 samples:

In [6]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


### Stratified k-fold

StratifiedKFold is a variation of k-fold which returns stratified folds

Each set contains approximately the same percentage of samples of each target class as the complete set.

Example of stratified 3-fold cross-validation on a dataset with 10 samples from two slightly unbalanced classes:

In [7]:
from sklearn.model_selection import StratifiedKFold

X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


## Pipeline

StandardScalar

PCA

SVM.SVR

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.pipeline import make_pipeline
#pipe_lr = make_pipeline(StandardScaler(),
#                        PCA(n_components=2),
#                        LogisticRegression(random_state=1))
pipe_svm = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        svm.SVR(kernel='linear', C=1))
pipe_svm.fit(X_train, y_train)
y_pred = pipe_svm.predict(X_test)
print('Test Accuracy: %.3f' % pipe_svm.score(X_test, y_test))

Test Accuracy: 0.391


In [9]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_svm,
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV accuracy scores: %s' % scores)


CV accuracy scores: [0.63971176 0.43579197 0.46977821 0.25027246 0.5124364  0.26221374
 0.30877195 0.54528563 0.37810066 0.47313549]


In [10]:
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores),
                                      np.std(scores)))

CV accuracy: 0.428 +/- 0.121


***