# Predictive Modeling # 
We will use Keras for a logistic regression model. In this notebook, we will be eradicating all instances that include a missing value. With the training at the bottom of this notebook, we reach an accuracy of 69.6% diagnosis. 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sb
import numpy as np 
import itertools

plt.style.use('fivethirtyeight')

from subprocess import check_output 

from sklearn.model_selection import train_test_split
from sklearn import metrics

In [2]:
diabetes = pd.read_csv('../data/pima-indians-diabetes.data.csv', header=None)

## Cleaning out the missing values ## 

In [3]:
diabetes.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [4]:
diabetes[[1,2,3,4,5]] = diabetes[[1,2,3,4,5]].replace(0, np.NaN)
print(diabetes.isnull().sum())

0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64


In [5]:
diabetes.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1


## Eradicate all instances with NaN ## 

In [6]:
diabetes.dropna(inplace=True)

In [7]:
train, test = train_test_split(diabetes, 
                               test_size=.20, 
                               random_state=0, 
                               stratify=diabetes[8])

train_X = train[train.columns[:8]]
test_X = test[test.columns[:8]]

train_Y = train[8]
test_Y = test[8]

print train_X.shape
print test_X.shape

(313, 8)
(79, 8)


In [8]:
count_1 = 0
count_0 = 0
for i in train_Y: 
    if i == 1: count_1+=1
    if i == 0: count_0+=1 

print count_1 
print count_0
print len(train_Y)

104
209
313


In [9]:
test_X = test_X.values
train_X = train_X.values
test_Y = test_Y.values
train_Y = train_Y.values

In [10]:
train_Y_format = np.zeros(shape=(len(train_Y), 2))
for i in range(len(train_Y)):
    if train_Y[i] == 0: 
        train_Y_format[i] = [1,0]
    elif train_Y[i] == 1: 
        train_Y_format[i] = [0,1]
    else: 
        print "uh oh"

In [11]:
test_Y_format = np.zeros(shape=(len(test_Y), 2))
for i in range(len(test_Y)):
    if test_Y[i] == 0: 
        test_Y_format[i] = [1,0]
    elif test_Y[i] == 1: 
        test_Y_format[i] = [0,1]
    else: 
        print "uh oh"

# Let's attempt to train a logistic regression model # 

In [12]:
from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.regularizers import L1L2

output_dim = nb_classes = 2
input_dim = 8

model = Sequential() 
model.add(Dense(output_dim, 
                input_dim=input_dim, 
                activation='softmax',
                kernel_regularizer=L1L2(l1=0.0, l2=0.1),
                )) 
batch_size = 5
nb_epoch = 40

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [13]:
model.compile(optimizer='adagrad', 
              loss='categorical_crossentropy', 
              metrics=['accuracy']) 

model.fit(train_X, train_Y_format, 
                    batch_size=batch_size, 
                    nb_epoch=nb_epoch,
                    verbose=1, 
                    validation_data=(test_X, test_Y_format)) 

score = model.evaluate(test_X, test_Y_format, verbose=0) 

print('Test score:', score[0]) 
print('Test accuracy:', score[1])



Train on 313 samples, validate on 79 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
('Test score:', 5.978144537044477)
('Test accuracy:', 0.6455696247800996)


# RESULTS #

So in this notebook, we chose to eradicate all instances with missing values. After training for 40 epochs with batch sizes of 5, we arrived at 73.4% accuracy. We will move on to the next notebook to attempt variations of data augmentation in hopes of increasing the usable dataset. 