# Predicting Diabetes

### Background: 
Diabetes is one of the most common and most expensive chronic diseases worldwide. In 2004 it was estimated that in the US alone, approximately 5 million people unknowingly had the disease while another 13 million were aware of their diagnosis. 

### Problem Statement:  
Early detection of the disease can help reduce the risk of serious life changing complications such as premature heart disease, stroke, blindness, limb amputations, and kidney failure.  Models that can help predict an individual with diabetes could be a useful tool to support a physician’s decision-making process when working with patients. It could also be leveraged to screen populations of patient data to identify patients most likely to have undiagnosed diabetes and intervene with further testing and monitoring. This can be framed as a binary classification problem to separate those who will vs. those who will not develop diabetes.

### Dataset

> **Citation for the data:** The Pima Indian Diabetes Dataset used originally came from this paper:
** Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press. Now available for download via Kaggle [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

* Note that because the dataset is hosted on Kaggle, it can be downloaded by generating an API token for your user id and installing the Kaggle-cli in your notebook environment. 
* For simplicity in this notebook, I downloaded and extracted the data on my local machine and then uploaded it into my notebook environment. The file 'diabetes.csv' is the unmodified extracted download file from Kaggle.

In [39]:
# confirm that the file is accessible
!ls -al data/raw/diabetes.csv

-rw-rw-r-- 1 ec2-user ec2-user 23873 Aug  5 06:00 data/raw/diabetes.csv


In [58]:

# Load the data into a dataframe
import pandas as pd
import os

source_data_dir = 'data/raw'
clean_data_dir = 'data/train-test'

diabetes_df = pd.read_csv(os.path.join(source_data_dir, 'diabetes.csv'))
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [59]:
diabetes_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Discussion of missing variables:

There are missing variable in the dataset although they're not all immediately apparent because they are coded as zeros rather than NaN or NA values. Because zero can be a valid measurement for some of the variables, we'll need to consider them one by one:

**Pregnancies** - 0 can be a valid measurement

**Glucose:** - 0 is unlikely to be a valid measurement

**Blood PRessure:** - 0 is unlikely to be a valid measurement

**Skin Thickness:** - 0 is unlikely to be a valid measurement

**Insulin:** - 0 is unlikely to be a valid measurement

**BMI:** - 0 is unlikely to be a valid measurement

**DiabetesPedigreeFunction:** - 0 can be a valid measurement score representing hereditary risk of diabetes based on familial and closeness of genetic relationships. 

In [60]:
# look at missing values -- indicated with a zero

print("Counts of zero as value:")
print("\t Glucose: {}".format(sum(diabetes_df.Glucose == 0)))
print("\t Blood Pressure: {}".format(sum(diabetes_df.BloodPressure == 0)))
print("\t SkinThickness: {}".format(sum(diabetes_df.SkinThickness == 0)))
print("\t Insulin: {}".format(sum(diabetes_df.Insulin == 0)))
print("\t BMI: {}".format(sum(diabetes_df.BMI == 0)))

print("\t Glucose + BloodPressure + BMI all 0: {}".format(sum((diabetes_df.Glucose == 0) &
                                                          (diabetes_df.BloodPressure == 0) &
                                                          (diabetes_df.BMI == 0))))

Counts of zero as value:
	 Glucose: 5
	 Blood Pressure: 35
	 SkinThickness: 227
	 Insulin: 374
	 BMI: 11
	 Glucose + BloodPressure + BMI all 0: 0


My intention is to use tree based models to because they can offer more straightforward explainability than non-linear models and in a healthcare context that can be important to those using the output of the model. Linear models such as trees don't necesarily need to have normalized data values. Likewise they should be able to make cuts around the missing values indicated as zeros. Based on these factors, in my initial modeling, I'm not going to normalize the data or remove the missing values. Depending on model performance this is something I will revisit if needed, however in a real world situation, where some of these data elements will likely be missing at prediction time, having a model that is robust enough to make good predictions even in their absence would have a lot of value.

In [61]:
# preprocess data - wrap in a function in case add'l preprocessing steps are needed
# to begin with, just splitting to train/test sets
# hold out 1/4 of data for testing

def preprocess_data(df):
    from sklearn.model_selection import train_test_split
    
    labels = df.Outcome
    data = df.drop(columns = ['Outcome'])
    
    random_state = 27
    X_train, X_test, y_train, y_test = train_test_split(data, labels, 
                                                        random_state = random_state,
                                                        test_size = 0.25)

    return X_train, X_test, y_train, y_test

In [62]:
X_train, X_test, y_train, y_test = preprocess_data(diabetes_df)

print("Training data shape: {} Training label shape: {}".format(X_train.shape, y_train.shape))
print("Testing data shape: {} Testing label shape: {}".format(X_test.shape, y_test.shape))

Training data shape: (576, 8) Training label shape: (576,)
Testing data shape: (192, 8) Testing label shape: (192,)


In [63]:
# structure data for processing by training model and store as csv
def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''

    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    df_X = pd.DataFrame(x)
    df_y = pd.DataFrame(y)
    
    df_all = pd.concat([df_y,df_X], axis=1)
    
    df_all.to_csv(data_dir + '/' + filename, index = False, header=False)

    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [65]:
make_csv(X_train, y_train, 'train.csv', clean_data_dir)
make_csv(X_test, y_test, 'test.csv', clean_data_dir)

Path created: data/train-test/train.csv
Path created: data/train-test/test.csv


In [66]:
# copy data to s3
import boto3
import sagemaker

# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

# set prefix, a descriptive name for a directory  
prefix = 'capstone'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=clean_data_dir, bucket=bucket, key_prefix=prefix)
print(input_data)

s3://sagemaker-us-west-2-501454055284/capstone


In [79]:
# build model
# your import and estimator code, here
from sagemaker.sklearn.estimator import SKLearn

output_path = 's3://{}/{}'.format(bucket, prefix)


from sagemaker.sklearn import SKLearn
sklearn_estimator = SKLearn(source_dir='source',
                            entry_point='train.py',
                            role=role,
                            train_instance_type='ml.c4.xlarge',
                            framework_version='0.20.0',
                            sagemaker_session=sagemaker_session)

In [80]:
%%time


sklearn_estimator.fit({'train': input_data})

2019-08-10 05:29:05 Starting - Starting the training job...
2019-08-10 05:29:08 Starting - Launching requested ML instances......
2019-08-10 05:30:13 Starting - Preparing the instances for training...
2019-08-10 05:31:03 Downloading - Downloading input data...
2019-08-10 05:31:34 Training - Training image download completed. Training in progress..
[31m2019-08-10 05:31:34,320 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-08-10 05:31:34,324 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-08-10 05:31:34,344 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-08-10 05:31:34,599 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-08-10 05:31:34,600 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-08-10 05:31:34,600 sagemaker-containers INFO     Generating MANIFEST.in[0m
[31m2

In [81]:
%%time

# deploy model to a create a predictor
predictor = sklearn_estimator.deploy(initial_instance_count=1, 
                                     instance_type='ml.t2.medium')

---------------------------------------------------------------------------------------------------!CPU times: user 493 ms, sys: 22.2 ms, total: 515 ms
Wall time: 8min 20s


In [82]:
# evaluate model
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(clean_data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

# Generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# make sure the right number of labels are returned
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Correct number of results returned.')

Correct number of results returned.


In [83]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score


print("ROC_AUC_Score: {}".format(roc_auc_score(test_y, test_y_preds)))
print("Accuracy Score: {}".format(accuracy_score(test_y, test_y_preds)))
print()


ROC_AUC_Score: 0.6765508684863524
Accuracy Score: 0.7447916666666666



In [84]:
# clean up and summarize
predictor.delete_endpoint()