# K Fold Cross Validation

* **K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes.These samples are called folds.**
* **This approach is a very popular CV approach because it is easy to understand, and the output is less biased than other methods.**
* **The steps for k-fold cross-validation are:**

    1) **Split the input dataset into K groups,**
    **For each group:**
        1) **Take one group as the reserve or test data set.**
        2) **Use remaining groups as the training dataset**
        3) **Fit the model on the training set and evaluate the performance of the model using the test set.**

![image.png](attachment:b4888773-5f4c-4e83-8b70-f1486df44ccd.png)

In [4]:
import pandas as pd 
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

digits = load_digits()
dir(digits)

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

### Now We Evaluate Score based on different Classifiers

In [16]:
# For Logistic Regression
le_model = LogisticRegression(max_iter=165)
le_model.fit(x_train,y_train)
print(f"Score of Logistic Regression Model = {le_model.score(x_test,y_test)}")

# For SVM
svm_model = SVC()
svm_model.fit(x_train,y_train)
print(f"Score of SVM Model = {svm_model.score(x_test,y_test)}")

# For Random Forest
forest_model = RandomForestClassifier()
forest_model.fit(x_train,y_train)
print(f"Score of Random Forest Model = {forest_model.score(x_test,y_test)}")

Score of Logistic Regression Model = 0.9722222222222222
Score of SVM Model = 0.9962962962962963
Score of Random Forest Model = 0.9833333333333333


### When we use KFold Cross Validation, A Basic Example is,

In [19]:
from sklearn.model_selection import KFold

fold = KFold(n_splits=3)
for train_index, test_index in fold.split([1,2,3,4,5,6,7,8,9]):
    print(train_index,test_index)

# Here you can see that , KFold model generated 3 random folds based on sample data

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


### using KFold in digits example, evaluating scores of any 3 model of each split 

In [23]:
from sklearn.model_selection import StratifiedKFold        # StratifiedKFold is more optimal version of KFold module

def getScore(model, x_train, x_test, y_train, y_test):
    model.fit(x_train,y_train)
    return model.score(x_test,y_test)

sfold = StratifiedKFold(n_splits=3)

le_scores = []
svm_scores = []
forest_scores = []

for train_index,test_index in sfold.split(digits.data,digits.target):
    x_train, x_test, y_train, y_test = digits.data[train_index], digits.data[test_index], digits.target[train_index], digits.target[test_index]

    le_scores.append( getScore(LogisticRegression(max_iter=165) , x_train, x_test, y_train, y_test) )
    svm_scores.append( getScore(SVC() , x_train, x_test, y_train, y_test))
    forest_scores.append( getScore(RandomForestClassifier() , x_train, x_test, y_train, y_test) )

print("Scores of Logistic Regression Model = ",le_scores)
print("Score of SVM Model = ",svm_scores)
print("Score of Random Forest Model = ",forest_scores)

Scores of Logistic Regression Model =  [0.9198664440734557, 0.9415692821368948, 0.9165275459098498]
Score of SVM Model =  [0.9649415692821369, 0.9799666110183639, 0.9649415692821369]
Score of Random Forest Model =  [0.9315525876460768, 0.9582637729549248, 0.9265442404006677]


### But when we use cross_val_score() function, we don't want to do the above complex computations

In [26]:
from sklearn.model_selection import cross_val_score

le_score = cross_val_score(LogisticRegression(max_iter=165), digits.data, digits.target, cv=3)
svm_score = cross_val_score(SVC(), digits.data, digits.target, cv=3)
forest_score = cross_val_score(RandomForestClassifier(), digits.data, digits.target, cv=3)

print("Scores of Logistic Regression Model = ",le_score)
print("Score of SVM Model = ",svm_score)
print("Score of Random Forest Model = ",forest_score)

Scores of Logistic Regression Model =  [0.91986644 0.94156928 0.91652755]
Score of SVM Model =  [0.96494157 0.97996661 0.96494157]
Score of Random Forest Model =  [0.94156928 0.95325543 0.93155259]


<h4>This helps in finding which model is best , by finding the average score of its scores </h4>