# Overview

This notebook will walk through how to build a neural network model using `sci-kit learn` for a binary target. The data in question is medical information about a number of adult women of Pima Indian heritage. The goal of the model is to help predict if the woman has diabetes.

@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences" }

# Setup

These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.

In [1]:
# All the needed imports
import pandas as pd 
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier, Perceptron
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import plot_confusion_matrix, accuracy_score
import statsmodels.api as sm

## Load the data and create a dataframe

pandas can read data locally or from a URL. In this case you'll read data from the data directory and create a dataframe named `diabetes` that has health information on 768 women who are over 21 and of Pima Indian heritage.

After reading the data you'll use the `shape` method to get a count of the rows and columns. There should be 768 rows and 9 columns.

Source: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.



In [2]:
diabetes = pd.read_csv('../data/diabetes.csv')
diabetes.shape

(768, 9)

## Look at the data for sanity check

After reading the data you'll print the first 5 rows using the `head` method to ensure the data appears correct.

In [3]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Seperate the inputs from the full data set

For modeling in sci-kit learn it is a standard practice to create different objects for the inputs (X varaibles, indepentdent variables) and the target (Y variable, dependent variable).

In [4]:
# first four columns of data
inputs = diabetes.iloc[:, 0:7]
target = diabetes.iloc[:, -1]

print(target[45:52], target.shape)

45    1
46    0
47    0
48    1
49    0
50    0
51    0
Name: Outcome, dtype: int64 (768,)


## Split the data into training and test

Creating a `training` and `validation` (some times called a `test`) set help prevent overfitting of the model. A model that is overfit will not be useful in predicting future behavior, which is the point of this modeling in the first place.

In [5]:
input_train, input_test, target_train, target_test = train_test_split(inputs, target, test_size = 0.30, random_state=9878)
print(input_train.shape, input_test.shape, target_train.shape, target_test.shape)

(537, 7) (231, 7) (537,) (231,)


## Scale the inputs

We need to scale the inputs to improve model performance. Scaling the inputs will substract the mean and scale to unit variance. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more information.

You will standardize the training and validation partiions seperatly to avoid bias and data leakage


## Training the model

To train a model, you must create an instance of the method (neural networks in this case) and then use the `fit` method. The way I remember the name is that I'm going to "fit" the inputs to the target. 

Below you will create a network with 10 hidden units in one layer. Feel free to experiment with different numbers of hidden units in 1 or more layers. To create a two hidden layers with 5 and 10 units respectively use this code `hidden_layer_sizes=(5, 10)`

## Using the Stats Model package

In [6]:
sm_reg = sm.Logit(target_train, input_train )
results = sm_reg.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.603167
         Iterations 5


0,1,2,3
Dep. Variable:,Outcome,No. Observations:,537.0
Model:,Logit,Df Residuals:,530.0
Method:,MLE,Df Model:,6.0
Date:,"Mon, 13 Jul 2020",Pseudo R-squ.:,0.04239
Time:,13:48:20,Log-Likelihood:,-323.9
converged:,True,LL-Null:,-338.24
Covariance Type:,nonrobust,LLR p-value:,7.015e-05

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Pregnancies,0.0975,0.029,3.321,0.001,0.040,0.155
Glucose,0.0117,0.003,3.718,0.000,0.006,0.018
BloodPressure,-0.0320,0.006,-5.759,0.000,-0.043,-0.021
SkinThickness,0.0016,0.007,0.233,0.816,-0.012,0.016
Insulin,0.0003,0.001,0.367,0.714,-0.001,0.002
BMI,-0.0118,0.013,-0.900,0.368,-0.038,0.014
DiabetesPedigreeFunction,0.3340,0.279,1.195,0.232,-0.214,0.882


In [7]:
log_reg = LogisticRegression()
log_reg.fit(input_train, target_train)

LogisticRegression()

In [8]:
reg = Perceptron()
reg.fit(input_train, target_train)
model_report(reg)

NameError: name 'model_report' is not defined

In [None]:
reg = Perceptron()
reg.fit(input_train, target_train)
model_report(reg)

## Prediction

To use the model created by the `fit` method, you must predict values. The code below uses the values from the validation partition to predict if the patient has diabeties. This prediction will then be compared to the actual values and you can assess the efficacy of the model

In [None]:
predictions = log_reg.predict(input_test)

## Model Efficacy

With predictons complete on the validation partition you can calculate the quality of the model. The confusion matrix, ROC chart, and classification report are a few ways to evaluate a model.

In [None]:
def model_report(model_obj):
    pred = model_obj.predict(input_test)
    print("Class: {}".format(model_obj.__class__))
    print(confusion_matrix(target_test,pred))
    print(classification_report(target_test,pred))
    print("Accuracy: {:0.4f}".format(accuracy_score(target_test,pred)))
    plot_roc_curve(model_obj, input_test, target_test);
    plot_confusion_matrix(model_obj, input_test, target_test, values_format ='');
    

In [None]:
print(confusion_matrix(target_test,predictions))
print(classification_report(target_test,predictions))

In [None]:
plot_roc_curve(log_reg, input_test, target_test);

In [None]:
plot_confusion_matrix(log_reg, input_test, target_test, values_format ='');

In [None]:
print("Accuracy: {:0.4f}".format(accuracy_score(target_test,predictions)))