# K-Fold Cross Validation (Simplified):
Imagine you’re practicing for a test. Instead of just solving one practice test, you split your study material into k parts and test yourself multiple times, using different parts as the "test" and the rest as "study material" each time. This way, you ensure you practice with all the material and don’t leave anything out.



What is K-Fold Cross Validation?
It’s a technique to evaluate a model’s performance by splitting the dataset into k equal parts (folds):

The model is trained on 
𝑘
−
1
k−1 folds (training data).
The model is tested on the remaining 1 fold (validation data).
This process is repeated 
𝑘
k times, with each fold being used as the test set once.
The final result is the average performance across all folds.



Why Use K-Fold Cross Validation?
Avoid Overfitting: It tests the model on multiple splits, ensuring it performs well on unseen data.
Efficient Use of Data: All data points are used for both training and testing, making the most of your dataset.
Reliable Evaluation: It provides a more accurate estimate of the model’s performance compared to a single train-test split.



How it Works (Step-by-Step):
Let’s say 
𝑘
=
5
k=5 (5-fold cross-validation):

Split the dataset into 5 equal parts (folds).
Train the model on 4 folds and test it on the 5th fold.
Rotate the test fold and repeat until every fold has been the test set once.
Compute the average accuracy (or other metrics) across all 5 iterations.



Real-World Example:
You have 100 data points and want to use 5-fold cross-validation.
The dataset is split into 5 parts of 20 points each:
In the 1st iteration, fold 1 is for testing, and folds 2-5 are for training.
In the 2nd iteration, fold 2 is for testing, and folds 1, 3-5 are for training.
Repeat for all 5 folds.
Calculate the average performance.

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
digits = load_digits()

In [2]:
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.3)
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9611111111111111

In [3]:
svc = SVC()
svc.fit(X_train, y_train)
svc.score(X_test, y_test)

0.9907407407407407

In [6]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9777777777777777

In [7]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3) # How many folds n_splits

In [8]:
for train_index,test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index,test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [9]:
def get_score(model,X_train,X_test,y_train,y_test):
    model.fit(X_train,y_train)
    return model.score(X_test,y_test)

In [10]:
get_score(LogisticRegression(),X_train,X_test,y_train,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9611111111111111

In [11]:
get_score(SVC(),X_train,X_test,y_train,y_test)

0.9907407407407407

In [12]:
get_score(RandomForestClassifier(),X_train,X_test,y_train,y_test)

0.9740740740740741

In [14]:
from sklearn.model_selection import StratifiedKFold # Its little better than k fold it will devide the data in such a way that each fold will have equal number of classes

folds = StratifiedKFold(n_splits=3)

In [21]:
scores_l = []
scores_svm = []
scores_rf = []
for train_index,test_index in kf.split(digits.data):
    X_train,X_test,y_train,y_test = digits.data[train_index],digits.data[test_index],digits.target[train_index],digits.target[test_index]
    scores_l.append(get_score(LogisticRegression(),X_train,X_test,y_train,y_test))
    scores_svm.append(get_score(SVC(),X_train,X_test,y_train,y_test))
    scores_rf.append(get_score(RandomForestClassifier(),X_train,X_test,y_train,y_test))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [22]:
scores_l

[0.9232053422370617, 0.9415692821368948, 0.9148580968280468]

In [23]:
scores_svm

[0.9666110183639399, 0.9816360601001669, 0.9549248747913188]

In [24]:
scores_rf

[0.9432387312186978, 0.9549248747913188, 0.9248747913188647]

But the upper code looks so messy so for fixing that we have a method in sklearn.model_selection which is import cross_val_score
ex : 

In [25]:
from sklearn.model_selection import cross_val_score
cross_val_score(LogisticRegression(),digits.data,digits.target) # It will return the score of each fold

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.92222222, 0.86944444, 0.94150418, 0.93871866, 0.89693593])

In [26]:
cross_val_score(SVC(),digits.data,digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [28]:
cross_val_score(RandomForestClassifier(),digits.data,digits.target) # cv = 2 means 2 folds

array([0.91991101, 0.922049  ])