# Cross Validation Techniques 

## K - Fold Cross Validation

KFold divides the samples into k groups (folds) of approximately equal sizes. Out of these k groups, k-1 folds are used for training and the remaning one is used for testing. This process is repeated k times

#### KFold(n_splits=5, *, shuffle=False, random_state=None)

n_splits --> number of folds, default=5 shuffle: bool, default=False Shuffle is used to shuffle the data before splitting it into batches. Samples within each split will not be shuffled.

random_state --> int, default=None This is used to control the randomness of each fold and it affects the ordering of indices only when shuffle=True, else it doesn't have any effect

In [3]:
import numpy as np
from sklearn.model_selection import KFold
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [4]:
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3,shuffle=False,random_state=None)

In [6]:
print(kf)

KFold(n_splits=3, random_state=None, shuffle=False)


In [7]:
#i=0
for train, test in kf.split(X):
    #print("Iteration:",i)
    print("Train:",train,"Test:",test)

Train: [2 3 4 5] Test: [0 1]
Train: [0 1 4 5] Test: [2 3]
Train: [0 1 2 3] Test: [4 5]


## Stratified K-Fold

This technique is a variation of K-Fold, and it divides the data into k-stratified folds. This way it preserves the percentage of samples of each class present in the data

It generates test sets such that all sets contain the same distribution of classes, or as close as possible

#### sklearn.model_selection.StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)

In [8]:
from sklearn.model_selection import StratifiedKFold

In [9]:
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])
skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)

for train_index,test_index in skf.split(X,y):
    print("Train:",train_index,'Test:',test_index)
    X_train,X_test = X[train_index], X[test_index]
    y_train,y_test = y[train_index], y[test_index]

Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Train: [0 1 2 4] Test: [3 5]


## Leave One Out Cross Validation

This is a simple technique in which training data inlcudes all observations in the data except one observation which will be used to test.

For n samples, we have n different training sets.

Although this model is trained on almost all of the data, the number of iterations and n different training sets, makes it computationally very expensive.

Almost all of the data (n-1 of the n samples) is used to build each model, all of the models are identical to each other and this results in high variance compared KFold.

#### sklearn.model_selection import LeaveOneOut( )

In [10]:
from sklearn.model_selection import LeaveOneOut

In [11]:
X = [10,20,30,40,50,60,70,80,90,100]
l = LeaveOneOut()

for train, test in l.split(X):
    print("%s %s"% (train,test))

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


## Hold Out Cross Validation

In [12]:
from sklearn.model_selection import train_test_split

In [14]:
X = [10,20,30,40,50,60,70,80,90,100]

train, test= train_test_split(X,test_size=0.3, random_state=1)
print("Train:",X_train)
print("Test:" ,X_test)

Train: [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 9 10]]
Test: [[ 7  8]
 [11 12]]


# Cross validation methods on Cancer data

In [15]:
df = pd.read_csv('data.csv')

In [16]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [18]:
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [19]:
df = df.drop(['Unnamed: 32'], axis=1)

In [20]:
df['diagnosis'].nunique()

2

In [21]:
df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [23]:
X = df.drop(['id','diagnosis'], axis=1)
y = df['diagnosis']

## Hold One Out Method

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
dt = DecisionTreeClassifier()

In [26]:
dtmodel = dt.fit(X_train,y_train)

In [27]:
dt.score(X_train,y_train)

1.0

In [28]:
hoo_result = dtmodel.score(X_test,y_test)

In [29]:
print("The accuracy score is for Hold one out method :",hoo_result)

The accuracy score is for Hold one out method : 0.935672514619883


## K-Fold Method

In [30]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score

In [31]:
dt = DecisionTreeClassifier()

In [32]:
kf= KFold(n_splits=5)

In [33]:
kfold_score = cross_val_score(dt,X,y,cv=kf)

In [34]:
print("The cross validation scores of k-fold method with 5 folds is",kfold_score)

The cross validation scores of k-fold method with 5 folds is [0.87719298 0.90350877 0.93859649 0.92105263 0.85840708]


In [35]:
kfold_score_mean = kfold_score.mean()

In [36]:
print("The min accuracy from k-fold CV is",min(kfold_score))
print("The max accuracy from k-fold CV is", max(kfold_score))
print("The mean cross validation scores of k-fold method with 5 folds is",kfold_score_mean)

The min accuracy from k-fold CV is 0.8584070796460177
The max accuracy from k-fold CV is 0.9385964912280702
The mean cross validation scores of k-fold method with 5 folds is 0.8997515913678


## Stratified K-Fold method

In [37]:
from sklearn.model_selection import StratifiedKFold

In [38]:
skfold = StratifiedKFold(n_splits=10)

In [39]:
skfold_score = cross_val_score(dt,X,y,cv=skfold)

In [40]:
print("The accuracy of Stratified k-fold method with 10 folds is",skfold_score)

The accuracy of Stratified k-fold method with 10 folds is [0.92982456 0.85964912 0.92982456 0.87719298 0.94736842 0.89473684
 0.87719298 0.94736842 0.92982456 0.98214286]


In [41]:
skfold_score_mean = skfold_score.mean()
print("The accuracy of Stratifieid k-fold method with 10 folds is",skfold_score_mean)

The accuracy of Stratifieid k-fold method with 10 folds is 0.9175125313283209


## Leave One Out Method

In [42]:
from sklearn.model_selection import LeaveOneOut

In [43]:
loocv = LeaveOneOut()

In [44]:
loocv_score = cross_val_score(dt,X,y,cv=loocv)

In [45]:
print("The accuracy of Leave one out method is",loocv_score)

The accuracy of Leave one out method is [1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1

In [46]:
loocv_score_mean = loocv_score.mean()
print("The average accuracy of Leave one out method with is",loocv_score_mean)

The average accuracy of Leave one out method with is 0.9226713532513181
