# Diabetes Analysis
----
Diabetes mellitus is a leading cause of mortality and reduced life expectancy. 

Our aim is to assist medical professionals in early predictions of the disease in patients by using and analysing medical health records.

To achieve this, machine learning will be used to find patterns to detect early signs of diabetes.

## What is Diabetes?
-----
Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

## Import Required Packages
----

Firstly, we need to install river. 

River is an online machine learning library, meaning that it contains models that can be trained continuously in production. If changes in the data occurs, the model is capable of adapting.

In [None]:
!pip install river

Collecting river
  Downloading river-0.9.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 28.1 MB/s 
Installing collected packages: river
Successfully installed river-0.9.0


In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import train_test_split
from river.metrics import ClassificationReport
from river.ensemble import AdaptiveRandomForestClassifier
from river import stream
from river import evaluate
from river import metrics
import pickle

from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True) 
template = 'plotly_dark'

In [None]:
df = pd.read_excel('https://query.data.world/s/gyrfwi47zdpwzt2pyqjehlve2ujmtj')

In [None]:
df.head()

Unnamed: 0,Patient number,Cholesterol,Glucose,HDL Chol,Chol/HDL ratio,Age,Gender,Height,Weight,BMI,Systolic BP,Diastolic BP,waist,hip,Waist/hip ratio,Diabetes,Unnamed: 16,Unnamed: 17
0,1,193,77,49,3.9,19,female,61,119,22.5,118,70,32,38,0.84,No diabetes,6.0,6.0
1,2,146,79,41,3.6,19,female,60,135,26.4,108,58,33,40,0.83,No diabetes,,
2,3,217,75,54,4.0,20,female,67,187,29.3,110,72,40,45,0.89,No diabetes,,
3,4,226,97,70,3.2,20,female,64,114,19.6,122,64,31,39,0.79,No diabetes,,
4,5,164,91,67,2.4,20,female,70,141,20.2,122,86,32,39,0.82,No diabetes,,


We need to firstly remove the patient number column and unnamed columns.

In [None]:
df = df.drop(['Patient number', 'Unnamed: 16',	'Unnamed: 17'], axis = 1)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Cholesterol      390 non-null    int64  
 1   Glucose          390 non-null    int64  
 2   HDL Chol         390 non-null    int64  
 3   Chol/HDL ratio   390 non-null    float64
 4   Age              390 non-null    int64  
 5   Gender           390 non-null    object 
 6   Height           390 non-null    int64  
 7   Weight           390 non-null    int64  
 8   BMI              390 non-null    float64
 9   Systolic BP      390 non-null    int64  
 10  Diastolic BP     390 non-null    int64  
 11  waist            390 non-null    int64  
 12  hip              390 non-null    int64  
 13  Waist/hip ratio  390 non-null    float64
 14  Diabetes         390 non-null    object 
dtypes: float64(3), int64(10), object(2)
memory usage: 45.8+ KB


The data set contains no null values.

The aim is to predict three classes (non-diabetes, pre-diabetes and diabetes), therefore we need to change the diabetes column to represent three classes. 

To achieve this we need to use the glucose column. The readings in this column is taken from a Fasting Plasma Glucose test which means that the glucose levels are recorded after a patient has fasted for at least 8 hours.

In [None]:
def classify(glucose):
    if glucose <= 99:
        return 'Non-Diabetic'
    if glucose >= 100 and glucose <= 125:
        return 'Prediabetic'
    else:
        return 'Diabetic' 

In [None]:
df['Diabetes'] = df['Glucose'].apply(classify)

In [None]:
df['Diabetes'].value_counts()

Non-Diabetic    260
Prediabetic      70
Diabetic         60
Name: Diabetes, dtype: int64

The data is unequally distributed which can cause the model to favour the majority class which is Non-Diabetes. But since the model that we will be using can be trained continuously on new data, our goal is to make sure that the model is able to predict all three classes. The accuracy can be improve going forward when new data is extracted.


In [None]:
df.columns

Index(['Cholesterol', 'Glucose', 'HDL Chol', 'Chol/HDL ratio', 'Age', 'Gender',
       'Height', 'Weight', 'BMI', 'Systolic BP', 'Diastolic BP', 'waist',
       'hip', 'Waist/hip ratio', 'Diabetes'],
      dtype='object')

After taking a brief look at the columns in the data set, some of them seems redundant. For example, there is already BMI that represents both height and weight therefore height and weight can be removed. The same goes for waist, hip, cholesterol and hdl chol.

In [None]:
df = df.drop(['Height', 'Weight', 'waist', 'hip', 'Cholesterol', 'HDL Chol'], axis=1)

In [None]:
df.head()

Unnamed: 0,Glucose,Chol/HDL ratio,Age,Gender,BMI,Systolic BP,Diastolic BP,Waist/hip ratio,Diabetes
0,77,3.9,19,female,22.5,118,70,0.84,Non-Diabetic
1,79,3.6,19,female,26.4,108,58,0.83,Non-Diabetic
2,75,4.0,20,female,29.3,110,72,0.89,Non-Diabetic
3,97,3.2,20,female,19.6,122,64,0.79,Non-Diabetic
4,91,2.4,20,female,20.2,122,86,0.82,Non-Diabetic


Blood Pressure consists of two columns that represent systolic and diastolic.

* **Systolic blood pressure:** measures the force your heart exerts on the walls of your arteries each time it beats.
* **Diastolic blood pressure:** measures the force your heart exerts on the walls of your arteries in between beats.

Patients won't know what this means at first sight, therefore it should be represented at one entity. To do that we need to make a new column that represents the average blood pressure of patient, but patients will only see it as blood pressure.

After get the average blood pressure, the systolic and diasystolic columns can be removed.

In [None]:
df['BloodPressure'] = (df['Systolic BP'] + df['Diastolic BP']) / 2
# Make sure the values are of data type int
df['BloodPressure'] = df['BloodPressure'].apply(lambda x: int(x))

In [None]:
df = df.drop(['Systolic BP', 'Diastolic BP'], axis=1)

In [None]:
df.head()

Unnamed: 0,Glucose,Chol/HDL ratio,Age,Gender,BMI,Waist/hip ratio,Diabetes,BloodPressure
0,77,3.9,19,female,22.5,0.84,Non-Diabetic,94
1,79,3.6,19,female,26.4,0.83,Non-Diabetic,83
2,75,4.0,20,female,29.3,0.89,Non-Diabetic,91
3,97,3.2,20,female,19.6,0.79,Non-Diabetic,93
4,91,2.4,20,female,20.2,0.82,Non-Diabetic,104


Now we make a new column for gender and diabetes in number format to observe the correlation.

In [None]:
def num_format(gender):
    if gender == 'male':
        return 1
    else:
        return 0 

In [None]:
df['GenderBinary'] = df['Gender'].apply(num_format)

In [None]:
def diabetes_binary(glucose):
    if glucose <= 99:
        return 0
    if glucose >= 100 and glucose <= 125:
        return 1
    else:
        return 2

In [None]:
df['DiabetesBinary'] = df['Glucose'].apply(diabetes_binary)

In [None]:
df.corr()

Unnamed: 0,Glucose,Chol/HDL ratio,Age,BMI,Waist/hip ratio,BloodPressure,GenderBinary,DiabetesBinary
Glucose,1.0,0.28221,0.294392,0.129286,0.185117,0.122418,0.093372,0.820437
Chol/HDL ratio,0.28221,1.0,0.163201,0.228407,0.243329,0.095836,0.102938,0.281989
Age,0.294392,0.163201,1.0,-0.009164,0.275188,0.343769,0.084177,0.301015
BMI,0.129286,0.228407,-0.009164,1.0,0.100873,0.144787,-0.254189,0.193634
Waist/hip ratio,0.185117,0.243329,0.275188,0.100873,1.0,0.127308,0.346253,0.178323
BloodPressure,0.122418,0.095836,0.343769,0.144787,0.127308,1.0,0.052933,0.151256
GenderBinary,0.093372,0.102938,0.084177,-0.254189,0.346253,0.052933,1.0,0.049316
DiabetesBinary,0.820437,0.281989,0.301015,0.193634,0.178323,0.151256,0.049316,1.0


We will be exploring observations that has a correlation of 2 or higher in the Exploratory Data Analysis section.

## Exploratory Data Analysis
---

Based on extensive research, we will be discussing on features that will be removed.

In [None]:
fig = px.scatter(df, x="Age", y="Glucose", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Age vs Glucose",
)
fig.show(renderer="colab")

There is a clear seperation between patients when observing their glucose levels. Based on our data, it seems that age does not have an affect glucose levels. But based on research, aging causes a decrease in glucose tolerence, meaning patients at an advanced age can experience an increase in glucose levels. 

In [None]:
fig = px.scatter(df, x="Age", y="Waist/hip ratio", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Age vs Waist/hip ratio",

)
fig.show(renderer="colab")

Little to not correlation is observed above, but 

In [None]:
fig = px.scatter(df, x="Chol/HDL ratio", y="BMI", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Chol/HDL ratio vs BMI",
)
fig.show(renderer="colab")

Based on the observation, there seems to be little to no correlation between Chol/HDL ratio and BMI. The outliers could be the cause of the 0.228407 correlation which makes it seem as if there is a correlation.

In [None]:
fig = px.scatter(df, x="Chol/HDL ratio", y="Waist/hip ratio", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Chol/HDL ratio vs Waist/hip ratio",
)
fig.show(renderer="colab")

The same occured above where outliers affected the correlation.

In [None]:
fig = px.scatter(df, x="Age", y="BloodPressure", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Age vs Blood Pressure",

)
fig.show(renderer="colab")

It shows that age does have an affect of blood pressure. 
As you age, the vascular system changes. This includes your heart and blood vessels. In the blood vessels, there's a reduction in elastic tissue in your arteries, causing them to become stiffer and less compliant. As a result, your blood pressure increases.

In [None]:
fig = px.scatter(df, x="Glucose", y="Chol/HDL ratio", color="Diabetes")
fig.update_layout(
    template = template, 
    title = "Glucose vs Chol/HDL ratio",
)
fig.show(renderer="colab")

This also shows a clear seperation which doesn't indicate any correlation. Outliers have also affect the correlation.

Based on extensive research beyond the data, we came to a conclusion of which feature we will be using the model building phase.

#### Features to be used:
* Glucose
* Chol/HDL ratio 
* Age
* Systolic BP
* Waist/hip ratio

## Model Building
----
As stated in the beginning, we will be using an online machine learning model.
Our aim is make sure that the model is able to predict all three classes.

**Note:** The model that will be used was chosen based evaluating previous batch machine learning models. The online machine learning models follow the same methodology as batch learning models but online learning models can be trained continuously and has a model drift feature that allows them to perform better.

#### Model Drift
----
Model drift refers to the degradation of model performance due to changes in data and relationships between input and output variables.

In [None]:
X = df[[
    'Glucose', 'Chol/HDL ratio', 'Age', 'Systolic BP', 'Waist/hip ratio'
]]

y = df['Diabetes']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 42)

### Adaptive Random Forest
----
Adaptive Random Forest is an adaptation of the original Random Forest algorithm, which has been successfully applied to a multitude of machine learning tasks. 

In layman’s terms the original Random Forest algorithm is an ensemble of decision trees, which are trained using bagging and where the node splits are limited to a random subset of the original set of features. The "Adaptive" part of ARF comes from its mechanisms to adapt to different kinds of concept drifts, given the same hyper-parameters.

In [None]:
rf = AdaptiveRandomForestClassifier()

In [None]:
for x, y in stream.iter_pandas(x_train, y_train):
    rf = rf.learn_one(x, y)

The model trains on one observation at the time, this is a concept called streaming data. It is the ideal method of retraining the model on new observations in productions.

In [None]:
ypred = []
y_test_data = []
for x, y in stream.iter_pandas(x_test, y_test):
    ypred.append(rf.predict_one(x))
    y_test_data.append(y)

In [None]:
report = ClassificationReport()

In [None]:
for yt, yp in zip(y_test_data, ypred):
    report = report.update(yt, yp)

In [None]:
report

               Precision   Recall   F1      Support  
                                                     
    Diabetic       1.000    0.952   0.976        21  
Non-Diabetic       0.914    1.000   0.955        64  
 Prediabetic       0.875    0.538   0.667        13  
                                                     
       Macro       0.930    0.830   0.866            
       Micro       0.929    0.929   0.929            
    Weighted       0.927    0.929   0.921            

                   92.9% accuracy                    

The model was able to predict all three classes and performed extremely well based on the small amount of observations proved.

In [None]:
cm = metrics.ConfusionMatrix()

In [None]:
for yt, yp in zip(y_test_data, ypred):
    cm = cm.update(yt, yp)

In [None]:
cm

                    Diabetic  Non-Diabetic   Prediabetic
      Diabetic            20             0             1
  Non-Diabetic             0            64             0
   Prediabetic             0             6             7

The reason for the model predicting a few observation inaccurately is because some observations were similar that it was difficult to distinguish them. But as the model is trained on new observations, it will have clarity on how to distinguish them.

Now we pickle the model.

In [None]:
with open('model.pkl', 'wb') as f:
    pickle.dump(rf, f)

In [None]:
f.close()

## Conclusion
----
Finding a way to assist medical professionals to detect early signs of diabetes but using a machine learning model to predict the current state of a patient. This model can be deployed into a web application and helps patients identify their current state based on their current medical records. 

Pre-diabetes is the entry point of patients developing diabetes therefore patients who got a predict of pre-diabetes can take measures before complications occurs that can leave them in a weakened state.