# (Original) Predict the Onset of Diabetes Based on Diagnostic Measures (The Pima Indians Diabetes Database)
# (This example) Predict Age from all other features

**Reference: https://www.kaggle.com/uciml/pima-indians-diabetes-database  **

## Step 1: Verify that all requires libraries can be imported  
**np.random.seed is for for reproducibility of results**

In [1]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
np.random.seed(7)

Using TensorFlow backend.


## Step 2: Load the data 
**Questions: **  
**- How many rows and columns does the data have?**

In [2]:
# load pima indians dataset
dataset = np.loadtxt("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", delimiter=",")
print(dataset.shape)

(768, 9)


## Step 3: Preview first 5 rows and all columns  
** dataset[0:5, :] = Rows 0 to 4 with all columns **

In [0]:
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})

In [4]:
print(dataset[0:5, :])

[[6.00 148.00 72.00 35.00 0.00 33.60 0.63 50.00 1.00]
 [1.00 85.00 66.00 29.00 0.00 26.60 0.35 31.00 0.00]
 [8.00 183.00 64.00 0.00 0.00 23.30 0.67 32.00 1.00]
 [1.00 89.00 66.00 23.00 94.00 28.10 0.17 21.00 0.00]
 [0.00 137.00 40.00 35.00 168.00 43.10 2.29 33.00 1.00]]


## Step 4: Split the data (768 rows) into Training Set (first 700 rows) and Validation Set (remaining 68 rows)  
**- The first 8 columns (0 to 7) are our features used as input to the model**  
**- The last column (8) is the true label (diabetes or not) or the ground truth**    
#### Questions:
**- What is our input and output? **   
**- Why "0:8" in X and "8" in Y?   **


In [0]:
XTRAIN = dataset[:700,[0,1,2,3,4,5,6,8]]
YTRAIN = dataset[:700,7]
XVALIDATION = dataset[700:,[0,1,2,3,4,5,6,8]]
YVALIDATION = dataset[700:,7]

## Step 5: Review the dimensions (and values) of our Training Dataset and Validation Dataset
**Also preview some of the "input features" and "correct labels" for the datasets**

In [6]:
print(XTRAIN.shape)
print(YTRAIN.shape)
print(XVALIDATION.shape)
print(YVALIDATION.shape)
print(XTRAIN[0:3,])
print(YTRAIN[0:3])
print(XVALIDATION[0:3,])
print(YVALIDATION[0:3])

(700, 8)
(700,)
(68, 8)
(68,)
[[6.00 148.00 72.00 35.00 0.00 33.60 0.63 1.00]
 [1.00 85.00 66.00 29.00 0.00 26.60 0.35 0.00]
 [8.00 183.00 64.00 0.00 0.00 23.30 0.67 1.00]]
[50.00 31.00 32.00]
[[2.00 122.00 76.00 27.00 200.00 35.90 0.48 0.00]
 [6.00 125.00 78.00 31.00 0.00 27.60 0.56 1.00]
 [1.00 168.00 88.00 29.00 0.00 35.00 0.91 1.00]]
[26.00 49.00 52.00]


## Step 6: Create a neural network model with 12 neurons in layer 1, 8 neurons in layer 2, and 1 neuron as the last layer
**Questions:**  
**- Why is input_dim = 8? It can also be replaced with X[0, :]**


In [7]:
model = Sequential()
model.add(Dense(12, input_dim=8, activation='sigmoid'))
model.add(Dense(8, activation='sigmoid'))
model.add(Dense(1, activation='linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                108       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 104       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
Total params: 221
Trainable params: 221
Non-trainable params: 0
_________________________________________________________________
None


## Step 6: Check for proper neural connections by compiling the model

In [0]:
model.compile(loss='mse', optimizer='adam', metrics=['mse']) # mse = mean squared error

## Step 7: Do the Training (i.e. Fit the model)
**- We feed XTRAIN into the model and the model calculates errors using YTRAIN**  
**- In one epoch the model scans through the entire rows in the XTRAIN**  
**- Updating the number of epochs usually increases the accuracy of the model**  
**- To observe the accuracy on the VALIDATION data during the training, add ", validation_data = (XTEST, YTEST)" ** 

In [9]:
model.fit(XTRAIN, YTRAIN, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f4fddeb7710>

## Step 8: Evaluate the model on the Training data (the same data we used to train the model)

In [10]:
scores = model.evaluate(XTRAIN, YTRAIN)
print(model.metrics_names)
print(scores)
print("\n%s: %.2f" % (model.metrics_names[1], scores[1]))

['loss', 'mean_squared_error']
[951.827782156808, 951.827782156808]

mean_squared_error: 951.83


## Step 9: The real test of the model we trained
** We will evaluate the model on the "Unknown" dataset (i.e. validation dataset) **

In [11]:
scores = model.evaluate(XVALIDATION, YVALIDATION)
print("\n%s: %.2f" % (model.metrics_names[1], scores[1]))


mean_squared_error: 1020.77


## Step 10: Look into what the model actually predicted
** An example of what the model has predicted and comparison with the true classes **

In [12]:
print(XVALIDATION[0:5])
print ('')
print(YVALIDATION[0:5])

[[2.00 122.00 76.00 27.00 200.00 35.90 0.48 0.00]
 [6.00 125.00 78.00 31.00 0.00 27.60 0.56 1.00]
 [1.00 168.00 88.00 29.00 0.00 35.00 0.91 1.00]
 [2.00 129.00 0.00 0.00 0.00 38.50 0.30 0.00]
 [4.00 110.00 76.00 20.00 100.00 28.40 0.12 0.00]]

[26.00 49.00 52.00 41.00 27.00]


In [0]:
prediction = model.predict(XVALIDATION)

In [14]:
print(prediction[0:5])

[[4.60]
 [4.61]
 [4.61]
 [4.58]
 [4.60]]


## Step 12: How can the model's performance be improved?  
** - Increase the number of epochs to 100 or 150 **  
** - Add more layers into the neural networks **  
** - Increase/Decrease the number of rows in the training/validation set **