# Predictive Modeling # 
We will use Keras for a logistic regression model. In this notebook, instead of ignoring all instances with missing values, we chose to replace the missing values with the means of the feature. Previously, when we ignored all instances with missing values, our dataset dropped to 392 instances. With this simple form of data augmentation, we are able to preserve the size of 768 instances. 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sb
import numpy as np 
import itertools

plt.style.use('fivethirtyeight')

from subprocess import check_output 

from sklearn.model_selection import train_test_split
from sklearn import metrics

In [2]:
diabetes = pd.read_csv('../data/pima-indians-diabetes.data.csv', header=None)

## Cleaning out the missing values ## 

In [3]:
diabetes.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [4]:
diabetes_nan = diabetes.copy(deep=True)
diabetes_nan[[1,2,3,4,5]] = diabetes_nan[[1,2,3,4,5]].replace(0, np.NaN)
# print(diabetes_no_nan.isnull().sum())
print(diabetes_nan.head(3))

   0      1     2     3   4     5      6   7  8
0  6  148.0  72.0  35.0 NaN  33.6  0.627  50  1
1  1   85.0  66.0  29.0 NaN  26.6  0.351  31  0
2  8  183.0  64.0   NaN NaN  23.3  0.672  32  1


In [5]:
diabetes.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [6]:
means = diabetes_nan.mean(axis=0)
print means

0      3.845052
1    121.686763
2     72.405184
3     29.153420
4    155.548223
5     32.457464
6      0.471876
7     33.240885
8      0.348958
dtype: float64


## Data Augmentation to replace NaN ## 

In [7]:
diabetes_nan_mean = diabetes_nan.fillna(means)

In [8]:
diabetes_nan_mean

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148.0,72.000000,35.00000,155.548223,33.600000,0.627,50,1
1,1,85.0,66.000000,29.00000,155.548223,26.600000,0.351,31,0
2,8,183.0,64.000000,29.15342,155.548223,23.300000,0.672,32,1
3,1,89.0,66.000000,23.00000,94.000000,28.100000,0.167,21,0
4,0,137.0,40.000000,35.00000,168.000000,43.100000,2.288,33,1
5,5,116.0,74.000000,29.15342,155.548223,25.600000,0.201,30,0
6,3,78.0,50.000000,32.00000,88.000000,31.000000,0.248,26,1
7,10,115.0,72.405184,29.15342,155.548223,35.300000,0.134,29,0
8,2,197.0,70.000000,45.00000,543.000000,30.500000,0.158,53,1
9,8,125.0,96.000000,29.15342,155.548223,32.457464,0.232,54,1


# Begin Training # 

In [9]:
train, test = train_test_split(diabetes_nan_mean, 
                               test_size=.20, 
                               random_state=0, 
                               stratify=diabetes_nan_mean[8])

train_X = train[train.columns[:8]]
test_X = test[test.columns[:8]]

train_Y = train[8]
test_Y = test[8]

print train_X.shape
print test_X.shape

(614, 8)
(154, 8)


In [10]:
count_1 = 0
count_0 = 0
for i in train_Y: 
    if i == 1: count_1+=1
    if i == 0: count_0+=1 

print count_1 
print count_0
print len(train_Y)

214
400
614


In [11]:
test_X = test_X.values
train_X = train_X.values
test_Y = test_Y.values
train_Y = train_Y.values

In [12]:
train_Y_format = np.zeros(shape=(len(train_Y), 2))
for i in range(len(train_Y)):
    if train_Y[i] == 0: 
        train_Y_format[i] = [1,0]
    elif train_Y[i] == 1: 
        train_Y_format[i] = [0,1]
    else: 
        print "uh oh"

In [13]:
test_Y_format = np.zeros(shape=(len(test_Y), 2))
for i in range(len(test_Y)):
    if test_Y[i] == 0: 
        test_Y_format[i] = [1,0]
    elif test_Y[i] == 1: 
        test_Y_format[i] = [0,1]
    else: 
        print "uh oh"

# Let's attempt to train a logistic regression model # 

In [14]:
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.regularizers import L1L2

output_dim = nb_classes = 2
input_dim = 8

model = Sequential() 
model.add(Dense(output_dim, 
                input_dim=input_dim, 
                activation='softmax',
                kernel_regularizer=L1L2(l1=0.0, l2=0.1),
                )) 
batch_size = 10
nb_epoch = 60

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [15]:
model.compile(optimizer='adagrad', 
              loss='binary_crossentropy', 
              metrics=['accuracy']) 

model.fit(train_X, train_Y_format, 
                    batch_size=batch_size, 
                    epochs=nb_epoch,
                    verbose=1, 
                    validation_data=(test_X, test_Y_format)) 

score = model.evaluate(test_X, test_Y_format, verbose=0) 

print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Train on 614 samples, validate on 154 samples
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60
('Test score:', 0.5758569194124891)
('Test accuracy:', 0.7337662306698886)


# RESULTS #

In this notebook, we took upon a simple form of data augmentation where we are able to preserve all of the instances with missing values by filling them with averages of the the feature. Although this doesn't 100% reflect the individual, but this is a step in the right direction. With this, we've so far increased our accuracy to 77%