<h1 style='color:blue;' align='center'>K Fold Cross Validation</h2>

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.svm import SVC
import numpy as np

In [3]:
digits = load_digits()

In [4]:
print(type(digits.data))

<class 'numpy.ndarray'>


In [5]:
print(type(digits.target))

<class 'numpy.ndarray'>


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [7]:
lr = LogisticRegression(solver='liblinear', multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)



0.9703703703703703

**SVM**

In [8]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.30925925925925923

**Random Forest**

In [9]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.987037037037037

<h2 style='color:purple'>KFold cross validation</h2>

**Basic example**

In [10]:
from sklearn.model_selection import KFold

In [11]:
kf = KFold(n_splits=3)

In [12]:
for train_index, test_index in kf.split([10,21,39,14,57,66,27,18,99]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**Helper function**

In [13]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

KFold and StratifiedKFold are both techniques used for cross-validation in machine learning, but they are suited for different scenarios and datasets.

KFold Cross-Validation:

KFold is a simple cross-validation technique that divides the dataset into 'k' equal-sized folds or subsets.
It shuffles the data randomly before splitting to ensure randomness in each fold.
Each fold is used as the validation set once while the rest of the folds are used for training.
It's useful when the distribution of classes in the dataset is roughly uniform, and class imbalances are not a concern.
KFold does not take into account the distribution of class labels when creating folds.

StratifiedKFold Cross-Validation:

StratifiedKFold is an extension of KFold that takes class distribution into consideration.
It aims to maintain the same class distribution in each fold as in the original dataset.
It's particularly useful when dealing with imbalanced datasets where certain classes have significantly fewer samples.
By ensuring that each fold retains the same class distribution, StratifiedKFold helps prevent a fold from being dominated by a single class.
This can lead to more accurate and representative cross-validation results, especially when class imbalances are a concern.

In [14]:
from sklearn.model_selection import StratifiedKFold

In [15]:
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))



In [16]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [17]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [18]:
scores_rf

[0.9382303839732888, 0.9532554257095158, 0.9282136894824707]

<h2 style='color:purple'>cross_val_score function</h2>

In [19]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [20]:
cross_val_score(LogisticRegression(solver='liblinear', multi_class='ovr'), digits.data, digits.target, cv=3)
# Note - This may use the stratified Cross Validation



array([0.89482471, 0.95325543, 0.90984975])

**svm model performance using cross_val_score**

In [21]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.38063439, 0.41068447, 0.51252087])

**random forest performance using cross_val_score**

In [22]:
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.92821369, 0.93989983, 0.92988314])

cross_val_score uses stratifield kfold by default