## Practical Activity Classification using Neural Network

### 1 Practical Activity (Week 8)

#### 1.1 Classification Using Neural Networks

This notebook is an exercise for developing a Neural Network (NN) classifier for predicting presence of diabetes in patients.

#### 1.2 Task

Our aim is to build a classification model to predict diabetes. We will be using the diabetes dataset which contains 768 observations and 9 variables, as below.
- Pregnancies: Number of times pregnant.
- Glucose: Plasma glucose concentration.
- BloodPressure: Diastolic blood pressure (mm Hg).
- SkinThickness: Skinfold thickness (mm).
- Insulin: Hour serum insulin (mu U/ml).
- BMI: Basal metabolic rate (weight in kg/height in m).
- DiabetesPedigreeFunction: Diabetes pedigree function.
- Age: Age in years.
- Outcome: "1" represents the presence of diabetes while "0" represents the absence of it.

The dataset is available here:

👉 [kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

#### 1.3 Evaluation Metric

We will evaluate the performance of the model using accuracy, which represents the percentage of correctly classified samples.

##### 1.3.1 Step 1 - Loading the required libraries and modules

In [9]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

##### 1.3.2 Step 2 - Reading the data and performing basic data checks.

In [None]:
# Reading the data and performing basic data checks.
df = pd.read_csv("diabetes.csv")

print(df.shape)

df.describe()

(768, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In the above summary for the 'Outcome' variable, we observe that the mean value is 0.35, which means that around 35% of the observations in the dataset have diabetes.

##### 1.3.3 Step 3 - Creating the training and test datasets.

In [None]:
train, test = train_test_split(df, test_size=0.3, stratify=df["Outcome"])

X_train = train.drop("Outcome", axis=1)
y_train = train["Outcome"]

X_test = test.drop("Outcome", axis=1)
y_test = test["Outcome"]

print(X_train.shape)
print(X_test.shape)

(537, 8)
(231, 8)


##### 1.3.4 Step 4 - Building the neural network model.

In this step, we will build the neural network model using the sklearn 'Multi-Layer Perceptron Classifier' library.

___
More information here:

👉 [sklearn.neural_network: MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

We will use three hidden layers with the same number of neurons as the number of features in the dataset.
___

We will also select 'relu' as the activation function and 'adam' as the solver for weight optimisation.

In [None]:
mlp = MLPClassifier(
    hidden_layer_sizes=(8, 8, 8), activation="relu", solver="adam", max_iter=500
)

mlp.fit(X_train, y_train)



##### 1.3.5 Step 5 - Evaluating the neural network model.

In [None]:
predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)

print("Train accuracy: ", accuracy_score(y_train, predict_train))

Train accuracy:  0.7579143389199255


In [None]:
print("Test accuracy: ", accuracy_score(y_test, predict_test))

Test accuracy:  0.7229437229437229


### 2 Practical Activity Task (Week 8)

Try to find the best set of parameters for the NN model.

In [None]:
# Set up a list of values for each parameter for cross-validation
parameter_space = {
    "activation": ["tanh", "relu"],
    "solver": ["lbfgs", "adam"],
}

In [10]:
clf = GridSearchCV(mlp, parameter_space, n_jobs=-1)

# Fitting the model for grid search
clf.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [None]:
# Best parameters
print("Best parameters found: ", clf.best_params_)

# All results
means = clf.cv_results_["mean_test_score"]
stds = clf.cv_results_["std_test_score"]

for mean, std, params in zip(means, stds, clf.cv_results_["params"]):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

Best parameters found:  {'activation': 'relu', 'solver': 'lbfgs'}
0.670 (+/-0.037) for {'activation': 'tanh', 'solver': 'lbfgs'}
0.665 (+/-0.024) for {'activation': 'tanh', 'solver': 'adam'}
0.719 (+/-0.109) for {'activation': 'relu', 'solver': 'lbfgs'}
0.665 (+/-0.054) for {'activation': 'relu', 'solver': 'adam'}
