### Cross Validation

Cross-Validation is a resampling technique with the fundamental idea of splitting the dataset into 2 parts- training data and test data. Train data is used to train the model and the unseen test data is used for prediction. If the model performs well over the test data and gives good accuracy, it means the model hasn’t overfitted the training data and can be used for prediction.

#### Types of Cross Validation

### 1) KFold CV

In this resampling technique, the whole data is divided into k sets of almost equal sizes. The first set is selected as the test set and the model is trained on the remaining k-1 sets. The test error rate is then calculated after fitting the model to the test data.

In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are used to train the data and the error is calculated. This process continues for all the k sets.

<img src=  "kfold_cv.png" height="350" width="400">

In [16]:
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
import pandas as pd

In [6]:
X = np.array([10,20,30,40,50,60,70,80,90,100])

kf1 = KFold(n_splits=3)  # k = 3 = no of iterations
for train, test in kf1.split(X):
    print("Train data",train,"Test data",test)

Train data [4 5 6 7 8 9] Test data [0 1 2 3]
Train data [0 1 2 3 7 8 9] Test data [4 5 6]
Train data [0 1 2 3 4 5 6] Test data [7 8 9]


In [7]:
kf1 = KFold(n_splits=5)
for train, test in kf1.split(X):
    print("Train data",train,"Test data",test)

Train data [2 3 4 5 6 7 8 9] Test data [0 1]
Train data [0 1 4 5 6 7 8 9] Test data [2 3]
Train data [0 1 2 3 6 7 8 9] Test data [4 5]
Train data [0 1 2 3 4 5 8 9] Test data [6 7]
Train data [0 1 2 3 4 5 6 7] Test data [8 9]


In [10]:
kf1 = KFold(n_splits=7)
for train, test in kf1.split(X):
    print("Train data",train,"Test data",test)

Train data [2 3 4 5 6 7 8 9] Test data [0 1]
Train data [0 1 4 5 6 7 8 9] Test data [2 3]
Train data [0 1 2 3 6 7 8 9] Test data [4 5]
Train data [0 1 2 3 4 5 7 8 9] Test data [6]
Train data [0 1 2 3 4 5 6 8 9] Test data [7]
Train data [0 1 2 3 4 5 6 7 9] Test data [8]
Train data [0 1 2 3 4 5 6 7 8] Test data [9]


In [13]:
kf1 = KFold(n_splits=4,shuffle=True)
for train, test in kf1.split(X):
    print("Train data",train,"Test data",test)

Train data [0 2 3 4 5 7 9] Test data [1 6 8]
Train data [0 1 2 6 7 8 9] Test data [3 4 5]
Train data [0 1 3 4 5 6 7 8] Test data [2 9]
Train data [1 2 3 4 5 6 8 9] Test data [0 7]


In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
df1  = pd.read_excel('insurance_data.xlsx')
print(df1.shape)
print(df1.columns)

(27, 2)
Index(['age', 'bought_insurance'], dtype='object')


In [17]:
x = df1[['age']]
y = df1['bought_insurance']
kf = KFold(n_splits=7)
m1 = LogisticRegression()
scores = cross_val_score(m1, x, y, scoring='accuracy',cv=kf)
print(scores)
print(scores.mean())

[0.75 0.75 1.   1.   0.75 0.75 1.  ]
0.8571428571428571


#### Score Metrics
https://scikit-learn.org/stable/modules/model_evaluation.html

### 2) Stratified Kfold

Suppose your data contains reviews for a cosmetic product used by both the male and female population. When we perform random sampling to split the data into train and test sets, there is a possibility that most of the data representing males is not represented in training data but might end up in test data. When we train the model on sample training data that is not a correct representation of the actual population, the model will not predict the test data with good accuracy.

This is where Stratified Sampling comes to the rescue. Here the data is split in such a way that it represents all the classes from the population.

Let’s consider the above example which has a cosmetic product review of 1000 customers out of which 60% is female and 40% is male. I want to split the data into train and test data in proportion (80:20). 80% of 1000 customers will be 800 which will be chosen in such a way that there are 480 reviews associated with the female population and 320 representing the male population. In a similar fashion, 20% of 1000 customers will be chosen for the test data ( with the same female and male representation).
<img src="startified_kf_cv.png" height="350" width="400">

In [18]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [26]:
x = np.array([5,10,15,20,25,30,35,40,45,50,60,70])
y = np.array([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1])
print(x.shape)
print(y.shape)

# 1 - [5,20,25,50,70,40]
# 0 - [10,15,30,35,45,60]

(12,)
(12,)


In [28]:
skf = StratifiedKFold(n_splits=2,shuffle=False)
for train,test in skf.split(x,y):
    print('Train',x[train],'Test',x[test])
    
# 1 - [5,20,25,50,70,40]
# 0 - [10,15,30,35,45,60]
# Case1 - Train[35(0) 40(1) 45(0) 50(1) 60(0) 70(1)]
# Case2 - 

Train [35 40 45 50 60 70] Test [ 5 10 15 20 25 30]
Train [ 5 10 15 20 25 30] Test [35 40 45 50 60 70]


In [29]:
skf = StratifiedKFold(n_splits=2,shuffle=True)
for train,test in skf.split(x,y):
    print('Train',x[train],'Test',x[test])
# 1 - [5,20,25,50,70,40]
# 0 - [10,15,30,35,45,60]
# Case1 - [10(0) 35(0) 40(1) 50(1) 60(0) 70(1)]

Train [10 35 40 50 60 70] Test [ 5 15 20 25 30 45]
Train [ 5 15 20 25 30 45] Test [10 35 40 50 60 70]


### 3) LOOCV (Leave One Our Cross Validation)
a) Instead of dividing the data into 2 subsets, we select a single observation as test data, and everything else is labeled as training data and the model is trained. Now the 2nd observation is selected as test data and the model is trained on the remaining data.


In [21]:
from sklearn.model_selection import LeaveOneOut

In [23]:
n = np.array([5,10,15,20,25,30,35,40,45,50])
loo = LeaveOneOut()
for train,test in loo.split(n):
    print('Train',n[train],'Test',n[test])

Train [10 15 20 25 30 35 40 45 50] Test [5]
Train [ 5 15 20 25 30 35 40 45 50] Test [10]
Train [ 5 10 20 25 30 35 40 45 50] Test [15]
Train [ 5 10 15 25 30 35 40 45 50] Test [20]
Train [ 5 10 15 20 30 35 40 45 50] Test [25]
Train [ 5 10 15 20 25 35 40 45 50] Test [30]
Train [ 5 10 15 20 25 30 40 45 50] Test [35]
Train [ 5 10 15 20 25 30 35 45 50] Test [40]
Train [ 5 10 15 20 25 30 35 40 50] Test [45]
Train [ 5 10 15 20 25 30 35 40 45] Test [50]
