### Capstone Project: Using Machine Learning to Reduce Human/Medical costs Associated with Diabetes.

##### Paul Taiwo-Adeyemo, July 2022, BrainStation

##### Notebook #4: Model Deployment

In this notebook the Logistic Regression Model from Dataset 2 would be deployed. This deployment would input lifestyle choices to predict the incidence of diabetes.

In [6]:
#import relevant packages
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import packages for Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
# Ignore futurewarnings
import warnings
warnings.filterwarnings('ignore')
#Import necessary packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler


In [7]:
#load dataset
dataset2_df = pd.read_csv('dataset2_cleaned.csv')

In [8]:
#define X and Y
X = dataset2_df.drop('Diabetes_binary', axis=1)
Y = dataset2_df['Diabetes_binary']

In [9]:
#split into train and test 
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.3, random_state=3)

In [10]:
#scale the data
scaler = MinMaxScaler()
scaler.fit(X_train)

#fit data
X_test_mmscalar = scaler.transform(X_test)
X_train_mmscalar = scaler.transform(X_train)

In [11]:
#Instantiate the model
logit = LogisticRegression()
#fit the model
logit.fit(X_train, y_train)

In [12]:
#export as pickle file for model deployment
pickle.dump(logit, open("model_to_deploy.pkl", "wb"))

### Making a Viable Product

###### The exported pickle model would be deployed using Heroku and a python app file.

### Discussion and Conclusion

>As it turns out, medications are not a good predictor of whether or not a patient would be readmitted for diabetes-related illness. On the other hand, lifestyle choices are a good predictor of getting diabetes. There was a good correlation between income, physical activity, BMI and diabetes. 
For income, the proportion of patients who got diabetes increased as the income increased. The negative proportionality would make sense as fruits and vegetables are expensive, and people would likely opt for unhealthy food options as they are much cheaper. 
The proportion of patients that had diabetes spiked when the BMI was between 39 and 41. This makes sense, as this range of BMI indicates morbid obesity.
Another interesting finding from the EDA is that a large proportion of patients that were readmitted for diabetes came to the hospital as an emergency, and most of them came in for blood-related illnesses.
The first sunset of dataset 1 had the least accuracy across all the models tested, with accuracy peaking at 62% in the AdaBoost model. The accuracy still did not exceed 63% after Keras Sequential Modelling was used in the deep learning environment.
On the other hand, the second subset of dataset1 had a much better accuracy that peaked at 76% using the Logistic Regression model. The only problem with the second subset of dataset 1 was that it had an unusually high recall rate, with some models showing over 90% recall. This can be a result of the overfitting dilemma.
On the other hand, the second dataset had great results, with accuracy peaking at 85% with the AdaBoost model. 

### Recommendation

>The saying 'prevention is the best medicine' holds true with regards to diabetes. The best predictors of diabetes of are the lifetsyle choices that are implemented before the onset of diabetes. Medical history say very little about the progression of diabetes. 

#### Improvements Over Prior Works on the Same Dataset
Much work has been done in the past on the UCI Diabetes dataset, with the accuracy peaking at 65% from researchers at the Oklahoma State University. However, I was able to improve on the accuracy and achieved an accuracy of  76.1% in the second subset of the dataset. With biological data, extrapolations are hard to make as there are too many interpersonal differences and health predictors rarely generalize, in short biology is a science of exceptions. This makes it hard to come up with strong predictors in the case of diabetes research. However, based on this project we can conclude that lifestyle choices are a good predictor of the likelihood of developing diabetes, while medical history says very little about the chances of being rehospitalization, with BMI and physical activity being the two best predictors of the incidence of diabetes. 


#### Model Deployment and Practical Use of Machine Learning Product
The interactive web apps from the model deployment can be used by users to make preemptive predictions about their relationship with diabetes. The Logistic model of the second dataset was made into a pickle file. A python app file containing instructions on how to execute the model in a web application, and the pickle file was used to create an interactive web application using Heroku. 


#### Further Work
From my research I noticed that there is very little data available that combines lifestyle choices with hospital data collected from patients. A dataset that includes both the patients' lifestyle choices along with the hospital data would be very useful data in building better machine learning models. This work can be improved by combining more healthcare data on diabetes, especially those with data on the lifestyle choices of people with diabetes. 


#### Reference
https://business.okstate.edu/site-files/archive/docs/analytics/3254-2015.pdf

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008
