# Predict the Onset of Diabetes Based on Diagnostic Measures (The Pima Indians Diabetes Database)

**Reference: https://www.kaggle.com/uciml/pima-indians-diabetes-database  **

## Step 1: Verify that all requires libraries can be imported  
**np.random.seed is for for reproducibility of results**

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
np.random.seed(7)

Using TensorFlow backend.


## Step 2: Load the data 
**Questions: **  
**- How many rows and columns does the data have?**

In [0]:
# load pima indians dataset
dataset = np.loadtxt("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", delimiter=",")
print(dataset.shape)

(768, 9)


## Step 3: Preview first 5 rows and all columns  
** dataset[0:5, :] = Rows 0 to 4 with all columns **

In [0]:
print(dataset[0:5, :])

[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01 1.000e+00]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01 0.000e+00]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01 1.000e+00]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01 0.000e+00]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01 1.000e+00]]


## Step 4: Split the data (768 rows) into Training Set (first 700 rows) and Validation Set (remaining 68 rows)  
**- The first 8 columns (0 to 7) are our features used as input to the model**  
**- The last column (8) is the true label (diabetes or not) or the ground truth**    
#### Questions:
**- What is our input and output? **   
**- Why "0:8" in X and "8" in Y?   **


In [0]:
XTRAIN = dataset[:700,0:8]
YTRAIN = dataset[:700,8]
XVALIDATION = dataset[700:,0:8]
YVALIDATION = dataset[700:,8]

## Step 5: Review the dimensions of our Training Dataset and Validation Dataset
**Also preview some of the "input features" and "correct labels" for the datasets**

In [0]:
print(XTRAIN.shape)
print(YTRAIN.shape)
print(XVALIDATION.shape)
print(YVALIDATION.shape)
print(XTRAIN[0:3,])
print(YTRAIN[0:3])
print(XVALIDATION[0:3,])
print(YVALIDATION[0:3])

(700, 8)
(700,)
(68, 8)
(68,)
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]
 [  8.    183.     64.      0.      0.     23.3     0.672  32.   ]]
[1. 0. 1.]
[[  2.    122.     76.     27.    200.     35.9     0.483  26.   ]
 [  6.    125.     78.     31.      0.     27.6     0.565  49.   ]
 [  1.    168.     88.     29.      0.     35.      0.905  52.   ]]
[0. 1. 1.]


## Step 6: Create a neural network model with 12 neurons in layer 1, 8 neurons in layer 2, and 1 neuron as the last layer
**Questions:**  
**- Why is input_dim = 8? It can also be replaced with X[0, :]**


In [0]:
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                108       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 104       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
Total params: 221
Trainable params: 221
Non-trainable params: 0
_________________________________________________________________
None


## Step 6: Check for proper neural connections by compiling the model

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Step 7: Do the Training (i.e. Fit the model)
**- We feed XTRAIN into the model and the model calculates errors using YTRAIN**  
**- In one epoch the model scans through the entire rows in the XTRAIN**  
**- Updating the number of epochs usually increases the accuracy of the model**  
**- To observe the accuracy on the VALIDATION data during the training, add ", validation_data = (XTEST, YTEST)" ** 

In [0]:
model.fit(XTRAIN, YTRAIN, epochs=15, batch_size=10)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f4229beb4a8>

## Step 8: Evaluate the model on the Training data (the same data we used to train the model)

In [0]:
scores = model.evaluate(XTRAIN, YTRAIN)
print(model.metrics_names)
print(scores)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

['loss', 'acc']
[0.6603473850658962, 0.6885714292526245]

acc: 68.86%


## Step 9: The real test of the model we trained
** We will evaluate the model on the "Unknown" dataset (i.e. validation dataset) **

In [0]:
scores = model.evaluate(XVALIDATION, YVALIDATION)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))


acc: 60.29%


## Step 10: Look into what the model actually predicted
** An example of what the model has predicted and comparison with the true classes **

In [0]:
print(XVALIDATION[0:5])
print(YVALIDATION[0:5])

[[2.00e+00 1.22e+02 7.60e+01 2.70e+01 2.00e+02 3.59e+01 4.83e-01 2.60e+01]
 [6.00e+00 1.25e+02 7.80e+01 3.10e+01 0.00e+00 2.76e+01 5.65e-01 4.90e+01]
 [1.00e+00 1.68e+02 8.80e+01 2.90e+01 0.00e+00 3.50e+01 9.05e-01 5.20e+01]
 [2.00e+00 1.29e+02 0.00e+00 0.00e+00 0.00e+00 3.85e+01 3.04e-01 4.10e+01]
 [4.00e+00 1.10e+02 7.60e+01 2.00e+01 1.00e+02 2.84e+01 1.18e-01 2.70e+01]]
[0. 1. 1. 0. 0.]


In [0]:
prediction = model.predict(XVALIDATION)

In [0]:
print(prediction[0:5])

[[0.27730972]
 [0.16372836]
 [0.14317933]
 [0.33861685]
 [0.19883561]]


In [0]:
print(prediction[0:5].round())

[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


## Step 11: Accuracy is not sufficient to evaluate our model's ability to do binary classification  
** We can further evaluate the model using precision, recall, and F1-score **

In [0]:
accuracy = accuracy_score(YVALIDATION, prediction.round())
precision = precision_score(YVALIDATION, prediction.round())
recall = recall_score(YVALIDATION, prediction.round())
f1score = f1_score(YVALIDATION, prediction.round())
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("Precision: %.2f%%" % (precision * 100.0))
print("Recall: %.2f%%" % (recall * 100.0))
print("F1-score: %.2f" % (f1score))

Accuracy: 60.29%
Precision: 50.00%
Recall: 7.41%
F1-score: 0.13


## Step 12: How can the model's performance be improved?  
** - Increase the number of epochs to 100 or 150 **  
** - Add more layers into the neural networks **  
** - Increase/Decrease the number of rows in the training/validation set **