# Using neural networks

In this tutorial, we will use a classic neural network named Multilayer Perceptron, or MLP.

This is a very generic network that consists of the composition of several single perceptrons, as shown in the image below. The input data flows forwardly throughout the different layers, until it reaches the output nodes. The learning is performed using the same Gradient Descent technique that we have seen during the course! (With a bit more complex formula, due to the composition of gradients).


<img src="img/mlp.png" width="600">

First, let's import the necessary libraries.

In [1]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

The aim of this guide is to build a classification model to detect diabetes. For this, we will be using [Kaggle's diabetes dataset](https://www.kaggle.com/datasets/mathchi/diabetes-data-set). 

Ps: Don't you know what Kaggle is? Ask your instructor!

Load the dataset, contained in the `data/` folder, and print show the first 5 records. You can use function `read_csv` for this: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [2]:
df = pd.read_csv('data/diabetes.csv') 
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


How many observations and variables does the dataset contain?

In [3]:
df.shape

(768, 9)

The different variables for this dataset are described as follows:
* Pregnancies - Number of times pregnant.
* Glucose - Plasma glucose concentration.
* BloodPressure - Diastolic blood pressure (mm Hg).
* SkinThickness - Skinfold thickness (mm).
* Insulin - Hour serum insulin (mu U/ml).
* BMI – Basal metabolic rate (weight in kg/height in m).
* DiabetesPedigreeFunction - Diabetes pedigree function.
* Age - Age in years.
* Outcome - “1” represents the presence of diabetes while “0” represents the absence of it. This is the variable we want to create a predictor on.

Show some basic statistics for the dataset variables. You can use pandas' `describe()` for this purpose: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

In [4]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


Looking at the summary for the 'Outcome' variable, we observe that the mean value is 0.35, which means that around 35 percent of the observations in the dataset have diabetes. Therefore, the baseline accuracy is 65 percent and our neural network model should definitely beat this baseline benchmark.

Create 2 lists. One containing one element, the target variable name, and the other containing the other 8 predictor variables. We will use these lists to benefit from the pandas' slicing operators

In [5]:
target_column = ['Outcome'] 
predictors = list(set(list(df.columns))-set(target_column))

Normalize the predictive variables to have a maximum value of 1 and a minimum value of 0. For this, you can do your own implementation, or use sklearn's `MinMaxScaler` function: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Use again pandas' `describe` function to verify the correctness of your approach

In [6]:
df[predictors] = (df[predictors]-df[predictors].min())/(df[predictors].max()-df[predictors].min())
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,0.22618,0.19821,0.0,0.058824,0.176471,0.352941,1.0
Glucose,768.0,0.60751,0.160666,0.0,0.497487,0.58794,0.704774,1.0
BloodPressure,768.0,0.566438,0.158654,0.0,0.508197,0.590164,0.655738,1.0
SkinThickness,768.0,0.207439,0.161134,0.0,0.0,0.232323,0.323232,1.0
Insulin,768.0,0.094326,0.136222,0.0,0.0,0.036052,0.150414,1.0
BMI,768.0,0.47679,0.117499,0.0,0.406855,0.4769,0.545455,1.0
DiabetesPedigreeFunction,768.0,0.168179,0.141473,0.0,0.070773,0.125747,0.234095,1.0
Age,768.0,0.204015,0.196004,0.0,0.05,0.133333,0.333333,1.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


Slice the dataset into using the previously created indices, to craft your model's input and target

In [7]:
X = df[predictors].values
y = df[target_column].values

Use sklearn's `train_test_split` to split your dataset into a train and a test cohort. The test size should comprise the 30% of the total size. Use a _random\_state_ of 40: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)

Print the size of both your training and testing to verify that the split was done properly.

In [9]:
print(X_train.shape); print(y_train.shape)
print(X_test.shape); print(y_test.shape)

(537, 8)
(537, 1)
(231, 8)
(231, 1)


Time to model our Multilayer Perceptron! For this, you can use sklearn's `MLPClassifier` function: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
You can use the reference documentation to set 3 hidden layers, each with a composed of 8 neurons, and the maximum number of iterations to 500.

To train the model you can use the `fit()` function, as seen during the course.

If you encounter a sklearn warning about lack of convergence, you can increase a bit the argument _max\_iter_. But beware that you could run into an overfitting situation!

In [10]:
# If you encounter a sklearn warning that recommends you to use ravel(), you probably should write y_train.ravel()
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), max_iter=500)
mlp.fit(X_train, y_train.ravel())



MLPClassifier(hidden_layer_sizes=(8, 8, 8), max_iter=500)

Use the model's `predict()` function to obtain the predictions for the training set.

In [11]:
predict_train = mlp.predict(X_train)

Once the predictions are generated, we can evaluate the performance of the model. Being a classification algorithm, we would like to check the accuracy metrics. However, since the dataset is not completely balanced, the precision, recall, and f1 metrics are also very interesting to us.

Let's use sklearn's `confusion_matrix` function to obtain the confusion matrix from the training predictions: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [12]:
confusion_matrix(y_train, predict_train)

array([[318,  40],
       [ 62, 117]])

Sklearn also provides a function to conveniently verify the performance of our model. Use this function, `classification_report`, to see our performance: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [13]:
# Use a print statement for a better visualization
print(classification_report(y_train, predict_train))

              precision    recall  f1-score   support

           0       0.84      0.89      0.86       358
           1       0.75      0.65      0.70       179

    accuracy                           0.81       537
   macro avg       0.79      0.77      0.78       537
weighted avg       0.81      0.81      0.81       537



While results look promising, lets recall that all these are done on data already seen, as this is the data we have train with. Repeat the same process with the test predictions, and verify that the performance is still good.

In [14]:
predict_test = mlp.predict(X_test)
print(confusion_matrix(y_test, predict_test))
print(classification_report(y_test, predict_test))

[[123  19]
 [ 41  48]]
              precision    recall  f1-score   support

           0       0.75      0.87      0.80       142
           1       0.72      0.54      0.62        89

    accuracy                           0.74       231
   macro avg       0.73      0.70      0.71       231
weighted avg       0.74      0.74      0.73       231



We have also improved the baseline performance with unseen data. That is very good news!

The model can be further improved by doing cross-validation, feature engineering, or changing the arguments in the neural network estimator. Try to iterate and beat these results!

You can also compare your work with the notebooks provided in the [Kaggle's dataset code section](https://www.kaggle.com/datasets/mathchi/diabetes-data-set/code).