<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/NEURAL_NETWORKS_MULTI_LAYER_PERCEPTRON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fitting and Evaluating Multi-layer Perceptron


In this notebook, we will demonstrate how to fit and evaluate a Multi-layer Perceptron (MLP). We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data).

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of an MLP model.

In [None]:
import pandas as pd

# Data Preparation

**Clone the dataset Repository**

The prepared dataset after cleaning, removing outliers, and feature engineering can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_EDA.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_EDA.csv",sep=";")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 53659 records with 15 features for each record. Twelve features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

This data has been processed in previous notebooks
- Data Cleaning: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb
- Feature Selection and Feature Engineering: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb

As we noticed from the presented sample of the dataset above some features are highly correlated such as the age and the age_year features. So we need to drop one of these features. Besides, we will drop any not needed features such as the 'id' feature.

In [None]:
df.drop(['id','age'],axis=1, inplace=True)
df.head()

# Encode Categorical Data

We will use hot encoding through the get_dummies() method in pandas to encode the data in the 'gender' and 'smoke' features.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('age' and 'smoke') are dropped too.

In [None]:
df.drop(['gender_female','smoke_No'],axis=1,inplace=True)
df.head()

# Train And Evaluate MLP Classifier

**Train MLP Classifier**

We will start by specifying the independent variables and the dependent variable. The independent variables are the features that will be used to predict the target feature (class,label). And the dependent variable is the target feature (class, label).

In [None]:
# independent variables
X=df.drop(['cardio'],axis=1)
X.head()

In [None]:
# dependet variable (target feature, class, label)
Y=df.cardio
Y.head()

Now, we will import the MLP classifier model from sklearn and use the Cross-Validation method to evaluate the performance of the model

In [None]:
from sklearn.neural_network import MLPClassifier
model_nn = MLPClassifier()

from sklearn.model_selection import cross_validate
cv_value = 10

Score_nn = cross_validate(model_nn,X,Y,cv = cv_value, return_train_score=True)


The average performance measures of the model are

In [None]:
import numpy as np
ACC_test_nn = np.mean(Score_nn['test_score'])
ACC_train_nn = np.mean(Score_nn['train_score'])
fit_time_nn = np.mean(Score_nn['fit_time'])
score_time_nn = np.mean(Score_nn['score_time'])

print('fit_time = {}'.format(fit_time_nn))
print('score_time = {}'.format(score_time_nn))
print('train_score = {}'.format(ACC_train_nn))
print('test_score = {}'.format(ACC_test_nn))

Next we will compare NN with the the Random Forest Classifier.

In [None]:
## Ranom Forest
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(),X,Y,return_train_score=True)
ACC_test_rf = np.mean(Score_rf['test_score'])
ACC_train_rf = np.mean(Score_rf['train_score'])
fit_time_rf = np.mean(Score_rf['fit_time'])
score_time_rf = np.mean(Score_rf['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'RF', 'MLP'])
t.add_row(['Training (%)', ACC_train_rf*100, ACC_train_nn*100])
t.add_row(['Testing (%)', ACC_test_rf*100,  ACC_test_nn*100])
t.add_row(['fit_time', fit_time_rf, fit_time_nn])
t.add_row(['score_time', score_time_rf, score_time_nn])
print(t)

In the K-fold Cross-Validation notebook, we did a grid search to imporove the performance of the RF. We found out that setting the max_features=3, max_samples=2000, and n_estimators=200 improves the performance of RF.

In [None]:
from sklearn.ensemble import RandomForestClassifier
Score_rf = cross_validate(RandomForestClassifier(max_features=3,max_samples=2000,n_estimators=200),X,Y,return_train_score=True)
ACC_test_rf_optimized = np.mean(Score_rf['test_score'])
ACC_train_rf_optimized = np.mean(Score_rf['train_score'])
fit_time_rf_optimized = np.mean(Score_rf['fit_time'])
score_time_rf_optimized = np.mean(Score_rf['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'RF', 'RF Optimized','MLP'])
t.add_row(['Training (%)', ACC_train_rf*100, ACC_train_rf_optimized*100, ACC_train_nn*100])
t.add_row(['Testing (%)', ACC_test_rf*100, ACC_test_rf_optimized*100,  ACC_test_nn*100])
t.add_row(['fit_time', fit_time_rf, fit_time_rf_optimized, fit_time_nn])
t.add_row(['score_time', score_time_rf, score_time_rf_optimized, score_time_nn])
print(t)

To optimize the MLP

In [None]:
Score_nn = cross_validate(MLPClassifier(hidden_layer_sizes=(100,)),X,Y,cv = cv_value, return_train_score=True)
ACC_test_nn_optimized = np.mean(Score_nn['test_score'])
ACC_train_nn_optimized = np.mean(Score_nn['train_score'])
fit_time_nn_optimized = np.mean(Score_nn['fit_time'])
score_time_nn_optimized = np.mean(Score_nn['score_time'])

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'RF', 'RF Optimized','MLP', 'MLP Optimized'])
t.add_row(['Training (%)', ACC_train_rf*100, ACC_train_rf_optimized*100, ACC_train_nn*100, ACC_train_nn_optimized*100])
t.add_row(['Testing (%)', ACC_test_rf*100, ACC_test_rf_optimized*100,ACC_test_nn*100,ACC_test_nn_optimized*100])
t.add_row(['fit_time', fit_time_rf, fit_time_rf_optimized, fit_time_nn,fit_time_nn_optimized])
t.add_row(['score_time', score_time_rf, score_time_rf_optimized, score_time_nn,score_time_nn_optimized])
print(t)

Now, we will fit an MLP model using all the data we have

In [None]:
from sklearn.neural_network import MLPClassifier
model_nn = MLPClassifier(hidden_layer_sizes=(100,))
model_nn.fit(X,Y)

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(model_nn, './Model_nn.joblib')

And to load the trained random forest model, we will use the load() method

In [None]:
model_nn_joblib = jb.load('./Model_nn.joblib')

# Predict New Values Using Models

To predict the target values for new data, we will use the loaded model and any data

In [None]:
x_test = X.head(20)
y_test = Y.head(20)

In [None]:
y_predict = model_nn_joblib.predict(x_test)
dfnew=x_test.copy()
dfnew['cardio_predict']=y_predict

For the test split, we have the actual value of the 'cardio', so we can add it to the new dataframe for comparison purposes.

In [None]:
dfnew['cardio_actual']=y_test
dfnew.head()

Based on the measured accuracy above, the cardio_predict and cardio_acutal should match in ~70% (testing accuracy) of the records.

In [None]:
dfnew[dfnew['cardio_predict'] != dfnew['cardio_actual']]