# Overview

This notebook will walk through how to build a neural network model using `sci-kit learn` for a binary target. The data in question is medical information about a number of adult women of Pima Indian heritage. The goal of the model is to help predict if the woman has diabetes.

@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences" }

# Setup

These is where all the needed packages are imported for the exercise. If you get an `ModuleNotFoundError` then install the package (pip or conda) before continuing.

In [None]:
# All the needed imports
import pandas as pd 
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, accuracy_score

## Load the data and create a dataframe

pandas can read data locally or from a URL. In this case you'll read data from the data directory and create a dataframe named `diabetes` that has health information on 768 women who are over 21 and of Pima Indian heritage.

After reading the data you'll use the `shape` method to get a count of the rows and columns. There should be 768 rows and 9 columns.

Source: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.



In [None]:
diabetes = pd.read_csv('../data/diabetes.csv')
diabetes.shape

## Look at the data for sanity check

After reading the data you'll print the first 5 rows using the `head` method to ensure the data appears correct.

In [None]:
diabetes.head()

## Seperate the inputs from the full data set

For modeling in sci-kit learn it is a standard practice to create different objects for the inputs (X variables, independent variables) and the target (Y variable, dependent variable).

In [None]:
# first seven columns of data
inputs = diabetes.iloc[:, 0:7]
# the last column is the target
target = diabetes.iloc[:, -1]

print(target[45:52], target.shape)

## Split the data into training and test

Creating a `training` and `validation` (some times called a `test`) set help prevent overfitting of the model. A model that is overfit will not be useful in predicting future behavior, which is the point of this modeling in the first place.

Below we use the `train_test_split` method to seperate the data into a 70/30 split (70 percent for training and 30 percent for validation). We will use a random number seed so that we get consistent results from run to run.

In [None]:
input_train, input_test, target_train, target_test = train_test_split(inputs, target, test_size = 0.30, random_state=9878)
print(input_train.shape, input_test.shape, target_train.shape, target_test.shape)
print(input_train[:5])

## Scale the inputs

We need to scale the inputs to improve model performance. There are several scaling methods that can be used. The most popular are MinMaxScaling and Standard Scaling. Scaling needs to be performed for linear methods such as regression and neural networks, it does not need to be done for tree based methods.
Standard Scaling the inputs will substract the mean and scale to unit variance. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more information.

MinMaxScaling is described [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) 

You will standardize the training and validation partitions separately to avoid bias and data leakage.

In [None]:
scaler = MinMaxScaler()
scaler.fit(input_train)

input_train = scaler.transform(input_train)
input_test = scaler.transform(input_test)
print(input_train[:5])

## Training the model

To train a model, you must create an instance of the method (neural networks in this case) and then use the `fit` method. The way I remember the name is that I'm going to "fit" the inputs to the target. 

Below you will create a network with 10 hidden units in one layer. Feel free to experiment with different numbers of hidden units in 1 or more layers. To create a two hidden layers with 5 and 10 units respectively use this code `hidden_layer_sizes=(5, 10)`

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000,verbose=False)
mlp.fit(input_train, target_train)

## Prediction

To use the model created by the `fit` method, you must predict values. The code below uses the values from the validation partition to predict if the patient has diabeties. This prediction will then be compared to the actual values and you can assess the efficacy of the model.

In [None]:
predictions = mlp.predict(input_test)

## Model Efficacy

With predictons complete on the validation partition you can calculate the quality of the model. The confusion matrix, ROC chart, and classification report are a few ways to evaluate a model.

In [None]:
print(confusion_matrix(target_test,predictions))
print(classification_report(target_test,predictions))

In [None]:
plot_roc_curve(mlp, input_test, target_test);

In [None]:
plot_confusion_matrix(mlp, input_test, target_test);

In [None]:
accuracy_score(target_test,predictions)

In [None]:
plot_roc_curve(mlp, input_train, target_train);

In [None]:
def model_report(model_obj):
    pred = model_obj.predict(input_test)
    print("Class: {}".format(model_obj.__class__))
    print(confusion_matrix(target_test,pred))
    print(classification_report(target_test,pred))
    print("Accuracy: {:0.4f}".format(accuracy_score(target_test,pred)))
    plot_roc_curve(model_obj, input_test, target_test);
    plot_confusion_matrix(model_obj, input_test, target_test, values_format ='');

In [None]:
model_report(mlp)