<h1 style='color:blue;' align='center'>KFold Cross Validation Python Tutorial</h2>

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [3]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9555555555555556

**SVM**

In [4]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.37037037037037035

**Random Forest**

In [5]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9518518518518518

<h2 style='color:purple'>KFold cross validation</h2>

In [None]:
# the score value is change,as when we train and test split perform random sample is selected every times when 
# excute that shell and score value will chnage.
# then we can't decide score because change as sample it's changed
# so use k fold algorithm for split all data in some folds and use to train the model and test it and find score and then find avg



**Basic example**

In [6]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [19]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**Use KFold for our digits example**

In [9]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [11]:
# StratifiedKFold is same as k fold only diffrence is uniformly distruited features

In [20]:
list(digits.data)

[array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]),
 array([ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
         9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
        15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
         0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
        16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.]),
 array([ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
        14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
         1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
         0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,

In [21]:
list(digits.target)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 9,
 5,
 5,
 6,
 5,
 0,
 9,
 8,
 9,
 8,
 4,
 1,
 7,
 7,
 3,
 5,
 1,
 0,
 0,
 2,
 2,
 7,
 8,
 2,
 0,
 1,
 2,
 6,
 3,
 3,
 7,
 3,
 3,
 4,
 6,
 6,
 6,
 4,
 9,
 1,
 5,
 0,
 9,
 5,
 2,
 8,
 2,
 0,
 0,
 1,
 7,
 6,
 3,
 2,
 1,
 7,
 4,
 6,
 3,
 1,
 3,
 9,
 1,
 7,
 6,
 8,
 4,
 3,
 1,
 4,
 0,
 5,
 3,
 6,
 9,
 6,
 1,
 7,
 5,
 4,
 4,
 7,
 2,
 8,
 2,
 2,
 5,
 7,
 9,
 5,
 4,
 8,
 8,
 4,
 9,
 0,
 8,
 9,
 8,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 0,
 9,
 5,
 5,
 6,
 5,
 0,
 9,
 8,
 9,
 8,
 4,
 1,
 7,
 7,
 3,
 5,
 1,
 0,
 0,
 2,
 2,
 7,
 8,
 2,
 0,
 1,
 2,
 6,
 3,
 3,
 7,
 3,
 3,
 4,
 6,
 6,
 6,
 4,
 9,
 1,
 5,
 0,
 9,
 5,
 2,
 8,
 2,
 0,
 0,
 1,
 7,
 6,
 3,
 2,
 1,
 7,
 3,
 1,
 3,
 9,
 1,
 7,
 6,
 8,
 4,
 3,
 1,
 4,
 0,
 5,
 3,
 6,
 9,
 6,
 1,
 7,
 5,
 4,
 4,
 7,
 2,
 8,
 2,
 2,
 5,
 5,
 4,
 8,
 8,


In [28]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index],\
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [23]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [29]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [30]:
scores_rf

[0.9332220367278798, 0.9365609348914858, 0.9248747913188647]

<h2 style='color:purple'>cross_val_score function</h2>

In [31]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [32]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

**svm model performance using cross_val_score**

In [33]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.38063439, 0.41068447, 0.51252087])

**random forest performance using cross_val_score**

In [34]:
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.92654424, 0.95492487, 0.9115192 ])

cross_val_score uses stratifield kfold by default

<h2 style='color:purple'>Parameter tunning using k fold cross validation</h2>

In [35]:
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8770018621973928

In [36]:
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9421135940409682

In [37]:
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9482371198013656

In [38]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9426877715704531

Here we used cross_val_score to
fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result. 

<h2 style='color:purple'>Exercise</h2>

Use iris flower dataset from sklearn library and use cross_val_score against following
models to measure the performance of each. In the end figure out the model with best performance,
1. Logistic Regression
2. SVM
3. Decision Tree
4. Random Forest