***<span style="color:purple">K-Fold Cross Validation</span>***

<p style="color:maroon">K-fold cross-validation is a technique used to evaluate the performance of a machine learning model on a dataset. The basic idea is to split the data into K equal-sized folds or subsets, where K is a user-specified parameter.</p>

<p style="color:maroon">The model is then trained on K-1 folds of the data and evaluated on the remaining fold. This process is repeated K times, with each fold used exactly once as the validation data. The results from each fold are averaged to provide an estimate of the model's performance on the full dataset.</p>

<p style="color:maroon">K-fold cross-validation is a powerful technique for estimating the performance of a model and can help reduce the risk of overfitting to the training data. By using multiple folds for validation, the model is able to generalize better to new data and the estimate of its performance is less biased.</p>

In [1]:
import pandas as pd
import numpy as np
import sklearn

<p>The <b>"Forest Cover Type"</b> dataset is a popular dataset used for classification tasks in machine learning. It contains 54 input features and a target variable "Cover_Type", which represents the forest cover type. There are 7 different forest cover types in the dataset, represented by integers from 1 to 7.</p> 
<p>Here's a brief overview of the 7 forest cover types:</p>
<ol>
<li style="color:green">Spruce/Fir</li>
<li style="color:green">Lodgepole Pine</li>
<li style="color:green">Ponderosa Pine</li>
<li style="color:green">Cottonwood/Willow</li>
<li style="color:green">Aspen</li>
<li style="color:green">Douglas Fir</li>
<li style="color:green">Krummholz</li>
</ol>

**Main objective**

As the dataset is very large and it takes lots of time to compute three different supervised algorithm
we limit the sample with only 5000 data.
Therefore the predict score is less but main objective is to understand the use of k-fold cross validation

In [2]:
df_s=pd.read_csv("covtype.csv")
df = df_s.sample(n=5000, random_state=42)
df

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
250728,3351,206,27,726,124,3813,192,252,180,2271,...,0,0,0,0,0,0,1,0,0,1
246788,2732,129,7,212,1,1082,231,236,137,912,...,0,0,0,0,0,0,0,0,0,2
407714,2572,24,9,201,25,957,216,222,142,2191,...,0,0,0,0,0,0,0,0,0,2
25713,2824,69,13,417,39,3223,233,214,110,6478,...,0,0,0,0,0,0,0,0,0,2
21820,2529,84,5,120,9,1092,227,231,139,4983,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163049,2918,31,7,150,12,1727,218,225,143,1935,...,0,0,0,0,0,0,0,0,0,2
576428,2583,21,20,247,108,2191,203,194,122,595,...,0,0,0,0,0,0,0,0,0,2
562765,2581,70,15,150,62,1570,235,210,103,1873,...,0,0,0,0,0,0,0,0,0,2
412372,3052,33,16,371,107,1694,215,202,118,3408,...,0,1,0,0,0,0,0,0,0,1


**Setting Input and Target**

In [3]:
inputs=df.drop('Cover_Type',axis=1).values
target=df.Cover_Type.values

**<span style="color:purple">Importing K-Fold Cross Validation</span>**

In [4]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=3)

*splitting*

In [5]:
for train_index, test_index in kf.split(inputs,target):
    X_train, X_test, y_train, y_test = inputs[train_index], inputs[test_index], \
                                       target[train_index], target[test_index]

In [6]:
def get_score(model,x_train,x_test,y_train,y_test):
    model.fit(x_train, y_train)
    return model.score(x_test, y_test)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

**<p style="color:blue">This is to understand how k-fold really work internally</p>**

In [8]:
scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(inputs,target):
    X_train, X_test, y_train, y_test = inputs[train_index], inputs[test_index], \
                                       target[train_index], target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

**We have 3 score for each model and we show only average**

In [25]:
np.average(scores_logistic)

0.7033991761071555

In [26]:
np.average(scores_svm)

0.4794005184557326

In [27]:
np.average(scores_rf)

0.7573991804280201

**<span style="color:purple">Using cross_val_score</span>**

Now this is how it easily done with sklearn lib

In [40]:
from sklearn.model_selection import cross_val_score

In [32]:
loreg=cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), inputs, target,cv=3)
np.average(loreg)

0.7053990162351683

In [33]:
svm=cross_val_score(SVC(gamma='auto'), inputs, target,cv=3)
np.average(svm)

0.47940003835967504

In [34]:
rfc=cross_val_score(RandomForestClassifier(n_estimators=40),inputs, target,cv=3)
np.average(rfc)

0.7479996197639224