<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/PERFORMANCE_OF_MACHINE_LEARNING_SYSTEMS_%E2%80%93_K_FOLD_CROSS_VALIDATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PERFORMANCE OF MACHINE LEARNING SYSTEMS – K-FOLD CROSS-VALIDATION


In this notebook, we will demonstrate how to use the K-Fold cross validation to evaluate Random Forest models. We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data).

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of the Random Forest model.

In [None]:
import pandas as pd

# Data Preparation

**Clone the dataset Repository**

The prepared dataset after cleaning, removing outliers, and feature engineering can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_EDA.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_EDA.csv",sep=";")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 53659 records with 14 features for each record. Twelve features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

This data has been processed in previous notebooks
- Data Cleaning: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb
- Feature Selection and Feature Engineering: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb

As we noticed from the presented sample of the dataset above some features are highly correlated such as the age and the age_year features. So we need to drop one of these features. Besides, we will drop any not needed features such as the 'id' feature.

In [None]:
df.drop(['id','age'],axis=1, inplace=True)
df.head()

# Encode Categorical Data

We will use hot encoding through the get_dummies() method in pandas to encode the data in the 'gender' and 'smoke' features.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('age' and 'smoke') are dropped too.

In [None]:
df.drop(['gender_female','smoke_No'],axis=1,inplace=True)
df.head()

# Train And Evaluate Random Forest Classifier

**Train Random Forest Classifier**

We will start by specifying the independent variables and the dependent variable. The independent variables are the features that will be used to predict the target feature (class,label). And the dependent variable is the target feature (class, label).

In [None]:
# independent variables
X=df.drop(['cardio'],axis=1)
X.head()

In [None]:
# dependet variable (target feature, class, label)
Y=df.cardio
Y.head()

Then we will splitting the dataset into training and testing splits of the dataset, the split ratio is usually 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=200)
print('Size of the dataset = {}'.format(len(X)))
print('Size of the training dataset = {} ({}%)'.format(len(x_train), 100*len(x_train)/len(X)))
print('Size of the testing dataset = {} ({}%)'.format(len(x_test), 100*len(x_test)/len(X)))

Notice that we used a random_state so that the results are reproducible. You should avoid setting this argument in your production code so that the split is random at every run.

Now, we will import the random forest model from sklearn and train the model using the training split of the dataset.

In [None]:
from sklearn import ensemble
model_rf = ensemble.RandomForestClassifier()
model_rf.fit(x_train,y_train)

**Evaluate Random Forest Model**

To evaluate the model, we will compute the training and testing accuracy using the training and testing splits of the dataset

In [None]:
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Random Forest(%)'])
t.add_row(['Training', Acc_train_rf*100])
t.add_row(['Testing', Acc_test_rf*100])
print(t)

However, the results change with the change in the split of data between training and testing splits. Try running the code below several times and see how the value of the accuracy change.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)
model_rf = ensemble.RandomForestClassifier()
model_rf.fit(x_train,y_train)
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

t = PrettyTable(['Accuracy', 'Random Forest(%)'])
t.add_row(['Training', Acc_train_rf*100])
t.add_row(['Testing', Acc_test_rf*100])
print(t)

So what are the correct value of training and testing accuracy? To resolve this we use the Cross-Validation method.

In [None]:
from sklearn.model_selection import cross_validate
cv_value = 10
score_rf = cross_validate(model_rf,X,Y,cv = cv_value, return_train_score=True)

The average performance measures of the model are

In [None]:
print('fit_time = {}'.format(score_rf['fit_time'].mean()))
print('score_time = {}'.format(score_rf['score_time'].mean()))
print('train_score = {}'.format(score_rf['train_score'].mean()))
print('test_score = {}'.format(score_rf['test_score'].mean()))

# Manual Hyperparameter Tuning

Let us try to fine-tune the model parameters to improve the performance of the random forest model. We will do this without cross-validation. Let us try increasing the number of decision trees in the algorithm (n_estimators). The default value is 100.

In [None]:
model_rf = ensemble.RandomForestClassifier()
model_rf.fit(x_train,y_train)
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

model_rf_ne200 = ensemble.RandomForestClassifier(n_estimators=50)
model_rf_ne200.fit(x_train,y_train)
Acc_train_rf_ne200 = model_rf_ne200.score(x_train, y_train)
Acc_test_rf_ne200 = model_rf_ne200.score(x_test, y_test)

t = PrettyTable(['Accuracy (RF)', 'n_estimators = 100','n_estimators = 200'])
t.add_row(['Training', Acc_train_rf*100, Acc_train_rf_ne200*100])
t.add_row(['Testing', Acc_test_rf*100, Acc_test_rf_ne200*100])
print(t)

A very small improvement in model accuracy can be achieved. Notice that this is because increasing the number of estimators increases the degree of randomness and thus the improvement. Let us try changing the criterion in the random forest. We will use the 'entropy' while the default value was 'gini'

In [None]:
model_rf = ensemble.RandomForestClassifier(random_state=40)
model_rf.fit(x_train,y_train)
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

model_rf_entropy = ensemble.RandomForestClassifier(criterion='entropy', random_state=40)
model_rf_entropy.fit(x_train,y_train)
Acc_train_rf_entropy = model_rf_entropy.score(x_train, y_train)
Acc_test_rf_entropy = model_rf_entropy.score(x_test, y_test)

t = PrettyTable(['Accuracy (RF)', 'criterion=gini','criterion=entropy'])
t.add_row(['Training', Acc_train_rf*100, Acc_train_rf_entropy*100])
t.add_row(['Testing', Acc_test_rf*100, Acc_test_rf_entropy*100])
print(t)

Again, we achieved small or no improvement in accuracy. 

It seems that the model suffers from overfitting because the training accuracy is much higher than the testing accuracy. So let us try to gain some improvement in testing accuracy by optimizing the parameters related to overfitting such as the number of features to consider when looking for the best split (max_features) and the number of samples to draw from training data split to train each base estimator (max_samples). We will start by tuning max_features. Possible values are 2, 3, 4, ... 12.

In [None]:
for max_features in range(2,12,1):
   model_rf = ensemble.RandomForestClassifier(max_features=max_features)
   model_rf.fit(x_train,y_train)
   Acc_train_rf = model_rf.score(x_train, y_train)
   Acc_test_rf = model_rf.score(x_test, y_test)
   print('max_features = {}, Acc_train_rf = {}, Acc_test_rf = {}'.format(max_features,Acc_train_rf,Acc_test_rf))

So the maximum testing accuracy is achieved when max_features is 2 or 3. The number of records in x_train is 42927, so let us try different values for the max_samples.

In [None]:
for max_samples in range(1000,20000,1000):
    model_rf = ensemble.RandomForestClassifier(max_features=3,max_samples=max_samples,n_estimators=200)
    model_rf.fit(x_train,y_train)
    Acc_train_rf = model_rf.score(x_train, y_train)
    Acc_test_rf = model_rf.score(x_test, y_test)
    print('max_samples = {}, Acc_train_rf = {}, Acc_test_rf = {}'.format(max_samples,Acc_train_rf,Acc_test_rf))

# Automate Hyperparameter Tuning with Cross-validation 

**Grid Search with Cross Validation**

Instead of the manual search for tuning the classifier parameters with the cross-validation, we can use the GridSearchCV to automate the tuning of parameters.

In [None]:
#default cv value is 5
from sklearn.model_selection import GridSearchCV
parameters = {'max_features':range(2,8,1),'max_samples':range(1000,10000,1000),'n_estimators':[100,200]}
model_rf = ensemble.RandomForestClassifier()
clf = GridSearchCV(model_rf, parameters)
clf.fit(x_train, y_train)
clf.best_params_

After we decided on the best parameter values, we again fit the model using these parameters.

In [None]:
model_rf = ensemble.RandomForestClassifier('max_features'=2,'max_samples'=2000,'n_estimators'=200)
model_rf.fit(x_train,y_train)

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(model_rf, './Model_rf.joblib')

And to load the trained random forest model, we will use the load() method

In [None]:
model_rf_joblib = jb.load('./Model_rf.joblib')

# Predict New Values Using Models

To predict the target values for new data, we will use the loaded model

In [None]:
x_test.head()

In [None]:
y_predict = model_rf_joblib.predict(x_test)
dfnew=x_test.copy()
dfnew['cardio_predict']=y_predict

For the test split, we have the actual value of the 'cardio', so we can add it to the new dataframe for comparison purposes.

In [None]:
dfnew['cardio_actual']=y_test
dfnew.head()

Based on the measured accuracy above, the cardio_predict and cardio_acutal should match in ~97% (testing accuracy) of the records.

In [None]:
dfnew[dfnew['cardio_predict'] != dfnew['cardio_actual']]