<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/LOGISTIC_REGRESSION_AND_ITS_APPLICATION_TO_MULTI_CLASS_CLASSIFICATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## LOGISTIC REGRESSION AND ITS APPLICATION TO MULTI-CLASS CLASSIFICATION


In this notebook, we will demonstrate how to build and evaluate logistic regression models. We will work on a modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data).

# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of logistic regression models.

In [None]:
import pandas as pd
import warnings
#warnings.filterwarnings('ignore')

# Data Preparation

**Clone the dataset Repository**

The prepared dataset after cleaning, removing outliers, and feature engineering can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the cardio_EDA.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/cardio_EDA.csv",sep=";")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 53659 records with 14 features for each record. Twelve features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

This data has been processed in previous notebooks
- Data Cleaning: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_DATA_CLEANING.ipynb
- Feature Selection and Feature Engineering: https://github.com/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/EXPLORATORY_DATA_ANALYSIS_%E2%80%93_FEATURE_SELECTION_AND_FEATURE_ENGINEERING.ipynb

As we noticed from the presented sample of the dataset above some features are highly correlated such as the age and the age_year features. So we need to drop one of these features. Besides, we will drop any not needed features such as the 'id' feature.

In [None]:
df.drop(['id','age'],axis=1, inplace=True)
df.head()

# Encode Categorical Data

We will use hot encoding through the get_dummies() method in pandas to encode the data in the 'gender' and 'smoke' features.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('age' and 'smoke') are dropped too.

In [None]:
df.drop(['gender_female','smoke_No'],axis=1,inplace=True)
df.head()

# Perform And Evaluate Logistic Regression

**Performing Logistic Regression**

We will start by specifying the independent variables and the dependent variable. The independent variables are the features that will be used to predict the target feature (class,label). And the dependent variable is the target feature (class, label).

In [None]:
# independent variables
X=df.drop(['cardio'],axis=1)
X.head()

In [None]:
# dependet variable (target feature, class, label)
Y=df.cardio
Y.head()

Then we will splitting the dataset into training and testing splits of the dataset, the split ratio is usually 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=200)
print('Size of the dataset = {}'.format(len(X)))
print('Size of the training dataset = {} ({}%)'.format(len(x_train), 100*len(x_train)/len(X)))
print('Size of the testing dataset = {} ({}%)'.format(len(x_test), 100*len(x_test)/len(X)))

Notice that we used a random_state so that the results are reproducible. You should avoid setting this argument in your production code so that the split is random at every run.

Now, we will import the logistic regression model from sklearn and train the model using the training split of the dataset.

In [None]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(x_train,y_train)

**Evaluate Logistic Regression**

To evaluate the model, we will compute the training and testing accuracy using the training and testing splits of the dataset

In [None]:
Acc_train = logreg.score(x_train, y_train)
Acc_test = logreg.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logitic Regression (%)'])
t.add_row(['Training', Acc_train*100])
t.add_row(['Testing', Acc_test*100])
print(t)

**Manual Hyperparameter Tuning**

Let us try to fine-tune the model parameters to improve the performance of the logistic regressor. We will increase the maximum number of iterations (max_iter). The default value is 100.

In [None]:
logreg = linear_model.LogisticRegression(max_iter=2000)
logreg.fit(x_train,y_train)
Acc_train_max_iter = logreg.score(x_train, y_train)
Acc_test_max_iter = logreg.score(x_test, y_test)

t = PrettyTable(['Accuracy', 'Logitic Regression (%)', 'Logitic Regression (%) (max_iter)'])
t.add_row(['Training', Acc_train*100, Acc_train_max_iter*100])
t.add_row(['Testing', Acc_test*100, Acc_test_max_iter*100])
print(t)

A small improvement in model accuracy is achieved with the increase in the max number of iterations. Let us try changing the solver. We will use the 'liblinear' while the default value was 'lbfgs'

In [None]:
logreg = linear_model.LogisticRegression(solver='liblinear')
logreg.fit(x_train,y_train)
Acc_train_solver = logreg.score(x_train, y_train)
Acc_test_solver = logreg.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logitic Regression (%)', 'Logitic Regression (%) (max_iter)', 'Logitic Regression (%) (solver = liblinear)'])
t.add_row(['Training', Acc_train*100, Acc_train_max_iter*100, Acc_train_solver*100])
t.add_row(['Testing', Acc_test*100, Acc_test_max_iter*100, Acc_train_solver*100])
print(t)

Again, some more improvement in performance is achieved.

# Feature Scaling and/or Normalization

Let us try to use feature normalization to improve the performance of the logistic regressor. Here, we will use the MinMaxScaler from sklearn as below

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1))

scaler.fit(x_train)
x_train_normalized = scaler.transform(x_train)
x_test_normalized = scaler.transform(x_test)

Then we will fit the logistic model using the scaled features.

In [None]:
Acc_train_normalized = logreg.score(x_train_normalized, y_train)
Acc_test_normalized = logreg.score(x_test_normalized, y_test)

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logitic Regression (%)','Logitic Regression with Normalization(%)'])
t.add_row(['Training', Acc_train*100, Acc_train_normalized*100])
t.add_row(['Testing', Acc_test*100, Acc_test_normalized*100])
print(t)

As can be observed, the scaling of features worsen the performance of the model. So we will not scale features.

# Oversampling of Features - Class Imbalance 

We will try also to oversample the data s that we have a balanced dataset aiming to improve the performance of the logistic regressor. this technique is usually useful if the dataset is not balanced. We will try this technique for illustration although the dataset is already balanced. We will use the SMOTE technique for oversampling.

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 2)
x_train_res, y_train_res = sm.fit_resample(x_train, y_train.ravel())

Acc_train_res = logreg.score(x_train_res, y_train_res)
Acc_test_res = logreg.score(x_test, y_test)
print('The size of the records with cardio = 0 before ovsersampling is {}'.format(sum(y_train==0)))
print('The size of the records with cardio = 1 before ovsersampling is {}\n'.format(sum(y_train==1)))

print('The size of the records with cardio = 0 after ovsersampling is {}'.format(sum(y_train_res==0)))
print('The size of the records with cardio = 1 after ovsersampling is {}\n'.format(sum(y_train_res==1)))

from prettytable import PrettyTable
t = PrettyTable(['Accuracy', 'Logitic Regression (%)','Logitic Regression with resample(%)'])
t.add_row(['Training', Acc_train*100, Acc_train_res*100])
t.add_row(['Testing', Acc_test*100, Acc_test_res*100])
print(t)

# Saving and Loading Models

We will use the joblib method from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) to save and load the models. To save the model we use the dump method as

In [None]:
import joblib as jb
jb.dump(logreg, './Model_logreg.joblib')

And to load the rained logistic model, we will use the load() method

In [None]:
logreg_joblib = jb.load('./Model_logreg.joblib')

# Predict New Values Using Models

To predict the target values for new data, we will use the loaded model

In [None]:
x_test.head()

In [None]:
y_predict = logreg_joblib.predict(x_test)
dfnew=x_test
dfnew['cardio_predict']=y_predict

For the test split, we have the actual value of the 'cardio', so we can add it to the new dataframe for comparison purposes.

In [None]:
dfnew['cardio_actual']=y_test
dfnew.head()

Based on the measured accuracy above, the cardio_predict and cardio_acutal should match in ~72% (testing accuracy) of the records.