# Imports

Import necessary packages and data

In [None]:
# Import relevant packages
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns # Heatmap viz 
from sklearn.preprocessing import StandardScaler # Standardize features

In [None]:
# Import dataset
dataset = pd.read_csv('diabetes.csv')

# Exploratory Data Analysis
We will start by reviewing metadata, sample data, and descriptive statistics for our dataset. <br>
This section also includes missing data identification/replacement with 'NaN'.

In [None]:
# Review dataframe metadata
dataset.info()

# Note: All features are numeric. No null values. Sample size relatively small (767 patients)

In [None]:
# Review first few rows of dataset
dataset.head()

# Note: Zeroes are present in fields that should be non-zero (e.g. Insulin = 0, SkinThickness = 0). We will correct in next step

Consulting the original paper, it seems zeroes are used to indicate missing values. Let's replace zeroes for Glucose, BloodPressure, SkinThickness, Insulin, and BMI variables with NaNs:

In [None]:
columns = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
dataset[columns] = dataset[columns].replace(0,np.NaN)

In [None]:
# Plot histograms for 9 variables.
fig, axis = plt.subplots(3,3,figsize=(15, 15))
dataset.hist(ax=axis)

It's worth noting that the dataset is imbalanced - There are almost twice as many non-diabetics vs. diabetics (Outcome == 0 vs. == 1). <br>
After dividing up our train and test datasets, we will subsample the train dataset to resolve.

In [None]:
dataset.describe(include='all')

#Note: All values are reasonable - I will opt against outlier removal and StandardScalar should work fine for preprocessing

In [None]:
# Review correlations
corr = dataset.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

In [None]:
corr

Top correlation pairs: 
* BMI and SkinThickness (0.64)
* Insulin and Glucose (0.58)
* Age and Pregnancies (0.54)
* Glucose and Outcome (0.49)

Because correlation coefficients are well below < 0.95, I feel comfortable moving forward without variable removal/manipulation. <br>

# Prepare Datasets

Next, we will standardize our features (all numeric) using StandardScaler - this removes the mean and scales to unit variance. <br>
We will also divide the data into train and test datasets, and correct for Outcome class imbalance.

# Modeling

We will test a few approaches, maybe these:
* Logistic Regression
* Neural Networks
* XGBoost
