<a href="https://colab.research.google.com/github/jibintom/Machine-Learning-Codebasics-/blob/main/a12.%20K%20Fold%20Cross%20Validation/K_Fold_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np

from sklearn.datasets import load_digits
digits=load_digits()

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [2]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(digits.data, digits.target, test_size=.2, random_state=10)

**use a function to obtain the score**

In [3]:
def get_score(model,x_train,x_test,y_train,y_test):
  model.fit(x_train,y_train)
  return model.score(x_test,y_test)

In [4]:
get_score(LogisticRegression(),x_train,x_test,y_train,y_test)

0.95

In [5]:
get_score(SVC(),x_train,x_test,y_train,y_test)

0.9833333333333333

In [6]:
get_score(RandomForestClassifier(n_estimators=10),x_train,x_test,y_train,y_test)

0.9444444444444444

**Demonstration of K Fold**

In [10]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=3) #indicate the no of folds required

In [12]:
for train_index,test_index in kf.split([1,2,3,4,5,6,7,8,9]):
  print(train_index,test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


### Stratified K-Fold

In [13]:
def get_score(model,x_train,x_test,y_train,y_test):
  model.fit(x_train,y_train)
  return model.score(x_test,y_test)

In [17]:
from sklearn.model_selection import StratifiedKFold
skf=StratifiedKFold(n_splits=3)

In [22]:
score_lr=[]   #creating array for storing scores
score_svc=[]
score_rf=[]

for train_index,test_index in skf.split(digits.data,digits.target):
  x_train,x_test,y_train,y_test=digits.data[train_index], digits.data[test_index],digits.target[train_index],digits.target[test_index]
                                

  score_lr.append(get_score(LogisticRegression(),x_train,x_test,y_train,y_test))   #find the score and append the same to array
  score_svc.append(get_score(SVC(),x_train,x_test,y_train,y_test))
  score_rf.append(get_score(RandomForestClassifier(n_estimators=40),x_train,x_test,y_train,y_test))

print(score_lr)
print(score_svc)
print(score_rf)

[0.9215358931552587, 0.9415692821368948, 0.9165275459098498]
[0.9649415692821369, 0.9799666110183639, 0.9649415692821369]
[0.9482470784641068, 0.9515859766277128, 0.9298831385642737]


## cross_val_score function
***The above code look messy so in order to simplify this we have special library sklearn called cross_val_score***

In [24]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [25]:
cross_val_score(LogisticRegression(), digits.data, digits.target)

array([0.92222222, 0.86944444, 0.94150418, 0.93871866, 0.89693593])

**svm model performance using cross_val_score**

In [26]:
cross_val_score(SVC(), digits.data, digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

**random forest performance using cross_val_score**

In [27]:
cross_val_score(RandomForestClassifier(n_estimators=10), digits.data, digits.target)

array([0.87222222, 0.85      , 0.93314763, 0.93314763, 0.89972145])

**Parameter tunning using k fold cross validation**

In [30]:
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8741992551210428

In [31]:
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9382402234636871

In [33]:
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9360055865921787

In [34]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9421104903786468

Here we used cross_val_score to fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result.

### Exercise

Use iris flower dataset from sklearn library and use cross_val_score against following models to measure the performance of each. In the end figure out the model with best performance,

1.   Logistic Regression
2.   SVM
3.   Decision Tree
4.   Random Forest






In [35]:
from sklearn.datasets import load_iris
iris=load_iris()

In [36]:
from sklearn.model_selection import cross_val_score

1. Logistic Regression

In [39]:
score_lr=cross_val_score(LogisticRegression(),iris.data, iris.target, cv=10)
np.average(score_lr)

0.9733333333333334

2. SVM

In [41]:
score_svc=cross_val_score(SVC(),iris.data, iris.target, cv=10)
np.average(score_svc)

0.9733333333333334

3. Decision Tree

In [43]:
from sklearn.tree import DecisionTreeClassifier

score_tree=cross_val_score(DecisionTreeClassifier(),iris.data, iris.target, cv=10)
np.average(score_tree)

0.9533333333333334

4. Random Forest

In [44]:
score_rf=cross_val_score(RandomForestClassifier(),iris.data, iris.target, cv=10)
np.average(score_tree)

0.9533333333333334