<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/SUPERVISED_LEARNING_DECISION_TREE_AND_RANDOM_FOREST_CLASSIFICATION_TECHNIQUES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SUPERVISED LEARNING - DECISION TREE AND RANDOM FOREST CLASSIFICATION TECHNIQUES


In this notebook, we will demonstrate how to build and evaluate Decision Tree and Random Forest models. We will work on the Heart Failure dataset from Kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset).

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of the Decision Tree and Random Forest models.

In [None]:
import pandas as pd
import seaborn as sns

# Data Preparation

**Clone the dataset Repository**

The dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the heart.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/heart.csv",sep=",")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 1025 records with 14 features for each record. All features are numeric.

# Clean Data and Remove Outliers

**Check Missing Values**

Check if there are any missing values in the dataset

In [None]:
df.isnull().sum()

As can be observed, no missing data in the dataset.

**Remove Outliers**

Let us get the statistical description of the dataset and check if there is anything not normal

In [None]:
df.describe()

The 'Age' range is normal between 29 and 77 years. The sex takes two values; 1 = male; 0 = female. The chest pain type (cp) also takes only specific values; 0, 1, 2, and 3. Ideal resting blood pressure (trestbps) is between 90 and 120. Above 140 is considered high and below 90 is considered low. Normal 'chol' is less than 200 mg/dL, borderline high is 200 to 239 mg/dL, and high is at or above 240 mg/dL. A fasting blood Sugar Test of 99 mg/dL or lower is normal, 100 to 125 mg/dL indicates you have prediabetes, and 126 mg/dL or higher indicates you have diabetes. In the dataset the fbs = 1 implies a fasting blood sugar greater than 120 mg/dl, otherwise fbs = 0. Resting electrocardiographic results (restecg) take two values; 1 = detected some heart conditions, 0 did not detect certain heart conditions. thalach is the person’s maximum heart rate achieved. Exercise-induced angina (exang) takes two values; 1 = yes; 0 = no. oldpeak is the ST depression induced by exercise relative to rest. slope is the slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping. ca is the number of major vessels and takes four values; 0, 1, 2, and 3. thal is a blood disorder called thalassemia; 0: NULL (dropped from the dataset previously, 1: fixed defect (no blood flow in some part of the heart), 2: normal blood flow, 3: reversible defect (a blood flow is observed but it is not normal). And the target feature 'target' indicates a heart disease; 1 = no, 0= yes.

The dataset has no outliers and we leave it to you to check this as explained in a previous session




# Train And Evaluate Decision Tree Algorithm

**Train Decision Tree Algorithm**

We will start by specifying the independent variables and the dependent variable. The independent variables are the features that will be used to predict the target feature (class,label). And the dependent variable is the target feature (class, label).

In [None]:
# independent variables
X=df.drop(['target'],axis=1)
X.head()

In [None]:
# dependet variable (target feature, class, label)
Y=df.target
Y.head()

Then we will splitting the dataset into training and testing splits of the dataset, the split ratio is usually 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=200)
print('Size of the dataset = {}'.format(len(X)))
print('Size of the training dataset = {} ({}%)'.format(len(x_train), 100*len(x_train)/len(X)))
print('Size of the testing dataset = {} ({}%)'.format(len(x_test), 100*len(x_test)/len(X)))

Notice that we used a random_state so that the results are reproducible. You should avoid setting this argument in your production code so that the split is random at every run.

Now, we will import the decision tree model from sklearn and train the model using the training split of the dataset.

In [None]:
from sklearn import tree
model_dt = tree.DecisionTreeClassifier()
model_dt.fit(x_train,y_train)

**Evaluate Decision Tree Model**

To evaluate the model, we will compute the training and testing accuracy using the training and testing splits of the dataset

In [None]:
Acc_train_dt = model_dt.score(x_train, y_train)
Acc_test_dt = model_dt.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Decision Tree (%)'])
t.add_row(['Training', Acc_train_dt*100])
t.add_row(['Testing', Acc_test_dt*100])
print(t)

# Train And Evaluate Random Forest Algorithm

**Train Random Forest Algorithm**

We will use the same splits generated dring the training of the Decision Tree model before. So now, we will diectly import the random forest model from sklearn and train the model using the training split of the dataset.

In [None]:
from sklearn import ensemble
model_rf = ensemble.RandomForestClassifier()
model_rf.fit(x_train,y_train)

**Evaluate Random Forest Model**

To evaluate the model, we will compute the training and testing accuracy using the training and testing splits of the dataset

In [None]:
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

t = PrettyTable(['Accuracy', 'Decision Tree (%)','Random Forest(%)'])
t.add_row(['Training', Acc_train_dt*100, Acc_train_rf*100])
t.add_row(['Testing', Acc_test_dt*100, Acc_test_rf*100])
print(t)

As can be observed, the training and testing accuracy of the decision tree and the random forest is the same for the current splits of the dataset. The random forest usually achieves better prediction for larger and more complex datasets. We can also compare the performance of both of them with the logistic regressor in the previous sessions as follows

In [None]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression(max_iter=1000)
logreg.fit(x_train,y_train)
Acc_train_logreg = logreg.score(x_train, y_train)
Acc_test_logreg = logreg.score(x_test, y_test)

t = PrettyTable(['Accuracy', 'Decision Tree (%)','Random Forest (%)','Logistic (%)'])
t.add_row(['Training', Acc_train_dt*100, Acc_train_rf*100,Acc_train_logreg*100])
t.add_row(['Testing', Acc_test_dt*100, Acc_test_rf*100, Acc_test_logreg*100])
print(t)

The training and testing accuracy of the random forest is better than the logistic regression.

**Manual Hyperparameter Tuning**

Let us try to fine-tune the model parameters to improve the performance of the random forest model. We will increase the number of decision trees in the algorithm (n_estimators). The default value is 100.

In [None]:
model_rf = ensemble.RandomForestClassifier()
model_rf.fit(x_train,y_train)
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

model_rf_ne200 = ensemble.RandomForestClassifier(n_estimators=200)
model_rf_ne200.fit(x_train,y_train)
Acc_train_rf_ne200 = model_rf_ne200.score(x_train, y_train)
Acc_test_rf_ne200 = model_rf_ne200.score(x_test, y_test)

t = PrettyTable(['Accuracy (RF)', 'n_estimators = 100','n_estimators = 200'])
t.add_row(['Training', Acc_train_rf*100, Acc_train_rf_ne200*100])
t.add_row(['Testing', Acc_test_rf*100, Acc_test_rf_ne200*100])
print(t)

A very small improvement in model accuracy can be achieved. Notice that this is because increasing the number of estimators increases the degree of randomness and thus the improvement. Let us try changing the criterion in the random forest. We will use the 'entropy' while the default value was 'gini'

In [None]:
model_rf = ensemble.RandomForestClassifier(random_state=40)
model_rf.fit(x_train,y_train)
Acc_train_rf = model_rf.score(x_train, y_train)
Acc_test_rf = model_rf.score(x_test, y_test)

model_rf_entropy = ensemble.RandomForestClassifier(criterion='entropy', random_state=40)
model_rf_entropy.fit(x_train,y_train)
Acc_train_rf_entropy = model_rf_entropy.score(x_train, y_train)
Acc_test_rf_entropy = model_rf_entropy.score(x_test, y_test)

t = PrettyTable(['Accuracy (RF)', 'criterion=gini','criterion=entropy'])
t.add_row(['Training', Acc_train_rf*100, Acc_train_rf_entropy*100])
t.add_row(['Testing', Acc_test_rf*100, Acc_test_rf_entropy*100])
print(t)

No or almost no improvement is achieved in this case. You can also try feature scaling and/or normalization, oversampling in case of class imbalance.

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(model_rf, './Model_rf.joblib')

And to load the rained logistic model, we will use the load() method

In [None]:
model_rf_joblib = jb.load('./Model_rf.joblib')

# Predict New Values Using Models

To predict the target values for new data, we will use the loaded model

In [None]:
x_test.head()

In [None]:
y_predict = model_rf_joblib.predict(x_test)
dfnew=x_test.copy()
dfnew['target_predict']=y_predict

For the test split, we have the actual value of the 'cardio', so we can add it to the new dataframe for comparison purposes.

In [None]:
dfnew['target_actual']=y_test
dfnew.head()

Based on the measured accuracy above, the cardio_predict and cardio_acutal should match in ~96% (testing accuracy) of the records. Below are the miscalssified records.

In [None]:
dfnew[dfnew['target_predict'] != dfnew['target_actual']]