## Cross Validation

Cross validation is a technique used to find how well our ML Model does on unseen data. It helps adjust the training and test data set so that our ML model sees more of the data & help improve the accuracy. 

### Types of Cross Validation
- Hold Out Method
- Leave One Out Cross-Validation
- K-Fold Cross Validation
- Stratified k-Fold Cross Validation

Let us now understand them one by one

## Hold Out Method

The simplest cross validation method where the data is split into 2 sets, a training and a test. The split is donebased on % i.e. 70-30 or 60-40 or 75-25 or even 50-50 depending on use case.

Let us see the implementation next...

- test_size -> split % of data to be used at test data. Here we use 30% as test data 70% as training
- random_state -> Controls the shuffling applied to the data before applying the split

In [4]:
from sklearn.model_selection import train_test_split
X = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
X_train, X_test= train_test_split(X, test_size=0.3, random_state=1)
print('Train:', X_train, 'Test:', X_test)

Train: [50, 10, 40, 20, 80, 90, 60] Test: [30, 100, 70]


## Leave One Out Cross-Validation

Same as Hold Out Method but a single sample is taken as test data and rest all as training data. This process is repeated over entire data set so that each data acts as test data at least once.

As seen in the output, the test data contains only a single sample


In [5]:
from sklearn.model_selection import LeaveOneOut
X = [10,20,30,40,50,60,70,80,90,100]
l = LeaveOneOut()
for train, test in l.split(X):
    print("%s %s"% (train,test))

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


## K-Fold Cross Validation

The whole data is divided into k sets of almost equal sizes called `fold`. The first set is selected as the test set and the model is trained on the remaining k-1 sets. The test error rate is then calculated after fitting the model to the test data. The process is repeated by using each of the `fold/s` as test at least once.

- n_split -> Number of folds

In [6]:
from sklearn.model_selection import KFold
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3, shuffle=False, random_state=None)
for train, test in kf.split(X):
    print("Train data",train,"Test data",test)

Train data [2 3 4 5] Test data [0 1]
Train data [0 1 4 5] Test data [2 3]
Train data [0 1 2 3] Test data [4 5]


## Stratified K-Fold Cross Validation

Slight variation from K-Fold Cross Validation, which uses ‘stratified sampling’ instead of ‘random sampling.’ This helps further reduce bias by carefully choosing the `fold/s` instead of randomly forming them.

In [7]:
from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])

skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)

for train_index,test_index in skf.split(X,y):
    print("Train:",train_index,'Test:',test_index)
    X_train,X_test = X[train_index], X[test_index]
    y_train,y_test = y[train_index], y[test_index]

Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Train: [0 1 2 4] Test: [3 5]


## Bias-Variance Tradeoff

| Cross Validation Method  |  Variance |  Bias  |
|---|---|---|
|  Hold One Out | High |  High |
| Leave One Out | Moderate  | Lowest  |
| K-Fold  |  Low  | Low  |
| Stratified K-Fold  | Low  | Low  |


## Pro Tip

sklearn.model_selection has a method `cross_val_score` which simplifies the process of cross-validation. Instead of iterating through the complete data using the `split` function, we can use `cross_val_score` and check the accuracy score for the chosen cross-validation method

## Implementing Cross Valdation on used car price dataset