**Here in this notebook we will learn how to implement the K-Fold Cross Validation and how it is better than just train test split especially for the small datasets.**

Importing the dependencies:

In [73]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

Now we will import all the models which we are going to use in this use case and see how different and impactful K-fold cross validation works as compared to normal train_test_split.

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

Data Collection and Pre-Processing:

In [75]:
#loading the dataset into pandas dataframe

df = pd.read_csv('/content/drive/MyDrive/Datasets/heart.csv')

In [76]:
#printing the first five rows

df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [77]:
#no of rows and columns in the dataset

df.shape

(303, 14)

As we can see there are just 300 data points. So this is a very small dataset. 

In [78]:
#checking the null values

df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

There is no null value in the dataset, so this is a plus point for us otherwise we will have to handle the missing values first of all. This can be done by dropping the missing values from the dataset. or just replacing them with some other values. Which is also known as **IMPUTATION**

In [79]:
#checking the number of values in the label column

df['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

There is no larger variation in the values in target column so, the dataset seems to be balanced. 

Here:

0 -----> Healthy Heart

1 -----> Defective Heart


In [80]:
#splitting the features and the target

X = df.drop(columns = 'target', axis = 1)
Y = df['target']

In [81]:
print(X)

     age  sex  cp  trestbps  chol  ...  exang  oldpeak  slope  ca  thal
0     63    1   3       145   233  ...      0      2.3      0   0     1
1     37    1   2       130   250  ...      0      3.5      0   0     2
2     41    0   1       130   204  ...      0      1.4      2   0     2
3     56    1   1       120   236  ...      0      0.8      2   0     2
4     57    0   0       120   354  ...      1      0.6      2   0     2
..   ...  ...  ..       ...   ...  ...    ...      ...    ...  ..   ...
298   57    0   0       140   241  ...      1      0.2      1   0     3
299   45    1   3       110   264  ...      0      1.2      1   0     3
300   68    1   0       144   193  ...      0      3.4      1   2     3
301   57    1   0       130   131  ...      1      1.2      1   1     3
302   57    0   1       130   236  ...      0      0.0      1   1     2

[303 rows x 13 columns]


In [82]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64


#Train Test Split:

In [83]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 4)

In [84]:
print(X.shape, X_train.shape, X_test.shape)

(303, 13) (242, 13) (61, 13)


Comparing the performance of the various models:


In [85]:
#list of models

models = [LogisticRegression(max_iter = 1000), SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier()]

In [86]:
def compare_model_train_test():

  for model in models:

    #training the model
    model.fit(X_train, Y_train)

    #evaluating the model

    test_data_prediction = model.predict(X_test)

    accuracy = accuracy_score(Y_test, test_data_prediction)

    print('The accuracy of the', model, 'is', accuracy)

    


In [87]:
  compare_model_train_test()

The accuracy of the LogisticRegression(max_iter=1000) is 0.8688524590163934
The accuracy of the SVC(kernel='linear') is 0.9016393442622951
The accuracy of the KNeighborsClassifier() is 0.639344262295082
The accuracy of the RandomForestClassifier() is 0.8524590163934426


#Cross Validation:

Logistic Regression:

In [88]:
cv_score_lr = cross_val_score(LogisticRegression(max_iter = 1000), X, Y, cv = 5)
print(cv_score_lr)

mean_accuracy_lr = sum(cv_score_lr)/len(cv_score_lr)
mean_accuracy_lr = mean_accuracy_lr*100
print (mean_accuracy_lr)



[0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
82.82513661202186


Support Vector Classifier

In [89]:
cv_score_svc = cross_val_score(SVC(kernel = 'linear'), X, Y, cv = 5)
print(cv_score_svc)

mean_accuracy_svc = sum(cv_score_svc)/len(cv_score_svc)
mean_accuracy_svc = mean_accuracy_svc*100
print (mean_accuracy_svc)


[0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
82.83060109289619


KNeighborsClassifier()

In [90]:
cv_score_kn = cross_val_score(KNeighborsClassifier(), X, Y, cv = 5)
print(cv_score_kn)

mean_accuracy_kn = sum(cv_score_kn)/len(cv_score_kn)
mean_accuracy_kn = mean_accuracy_kn*100
print (mean_accuracy_kn)


[0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
64.38797814207649


RandomForest Classifier:

In [91]:
cv_score_rf = cross_val_score(RandomForestClassifier(), X, Y, cv = 5)
print(cv_score_rf)

mean_accuracy_rf = sum(cv_score_rf)/len(cv_score_rf)
mean_accuracy_rf = mean_accuracy_rf*100
print (mean_accuracy_rf)

[0.83606557 0.90163934 0.80327869 0.8        0.75      ]
81.81967213114754


In [92]:
def cv_score():

  for model in models:
    cv_score = cross_val_score(model, X, Y, cv = 5)
    mean_accuracy = sum(cv_score)/len(cv_score)
    mean_accuracy = mean_accuracy*100
    mean_accuracy = round(mean_accuracy, 2)

    print('Cross Validation accuracies for',  model, 'is', cv_score)
    print('Accuracy % of the', model, 'is', mean_accuracy)
    print('-----------------------------------------')

In [93]:
cv_score()

Cross Validation accuracies for LogisticRegression(max_iter=1000) is [0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
Accuracy % of the LogisticRegression(max_iter=1000) is 82.83
-----------------------------------------
Cross Validation accuracies for SVC(kernel='linear') is [0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
Accuracy % of the SVC(kernel='linear') is 82.83
-----------------------------------------
Cross Validation accuracies for KNeighborsClassifier() is [0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
Accuracy % of the KNeighborsClassifier() is 64.39
-----------------------------------------
Cross Validation accuracies for RandomForestClassifier() is [0.81967213 0.86885246 0.85245902 0.78333333 0.78333333]
Accuracy % of the RandomForestClassifier() is 82.15
-----------------------------------------
