Random Forest Model To Identify Severe Hypoglycemic Risk Patients
This tool is developed for the use of type 1 diabetics or physicians and other medical professionals who treat type 1 diabetics to predict a patient's risk of hypoglycemia using machine learning models.
To predict a patient's possibility of experiencing severe hypoglycemic events based on blood glucose levels and his/her day to day activities
3.Out Of Scope
Hypoglycemic unawareness, hyperglycemia or any other complications is beyond the scope of this project.
The dataset only has the records of 70 patients which may not be sufficient to understand the complete picture.
Type 1 diabetes is a chronic disease where patient's body no more produces enough insulin to regulate the blood glucose levels. This condition is treated by injecting actificial insulin into the body few times a day to maintain the body's blood glucose levels within the desired range. This treatment comes with side effects called hyperglycemia and hypoglycemia. Hyperglycemia is caused when lesser than required insulin is injected which causes the blood sugar levels to be above the desired range and Hypoglycemia is caused when more than required insulin is injected which causes the body's blood sugar levels to drop to sudden deadly lows which may cause coma or even death when not treated promptly.
While hyperglycemia is linked to long term complications in diabetics, hypoglycemia is a daily threat. Hypoglycemia is a result of day to day activities and are spontaneous and therefore can not be detected in advance. The medical professionals rely on daily blood glucose levels and a patient's day to day habits to treat the patient.
Please go through this article to understand the disease in depth.
This tool is developed primarily for the physicians and medical professionals to analyze a patient's possibility of experiencing hypoglycemic complications. The tool analyzes a patient's daily blood glucose levels and day to day activities to predict whether a patient is at rist to experience severe hypoglycemia.
Design Of The Study
The model uses the dataset got from the UCI machine learning repository. A random forest model was created using this dataset to acheive the goal. A three fold cross validation was performed on 60/40 train test split to test the accuracy of the model.
The implementation was done using python's machine learning libraries like numpy, pandas, matplotlib, seaborn and scikit-learn.
The dataset is a multivariate time series dataset.
The dataset contains blood glucose logs with the associated events of seventy patients for at least three weeks. The log for each patient is contained as a seperate file.The dataset contain's the following features.
- Date - Date of the blood glucose level.
- Time - Time of the blood glucose level.
- Code - A number that represent's the event or activity during the blood glucose level collection. Example: before breakfast.
- Blood Glucose level
A detailed information on the dataset can be found here.
The following steps were done to clean the data. Code and detailed comments can be seen in this link.
- Delete rows that does not have a valid blood glucose value.
- As the code feature is numerical for better readability add a new feature code_description which is a string representation of the numerical feature code. The string representation is got from the data description provided in the dataset.
- Alter date and time for invalid values by comparing with previous rows.
97% of the observations have been preserved.
Preliminary Analysis - Exploratory Data Analysis
Exploratory Dat Analysis was performed and following were the findings. Code and detailed comments can be seen in this link.
- The median blood glucose level for hypoglycemic population is slightly higher for non hypoglycemic population
- The standard deviation of blood glucose level is slightly higher for the hypoglycemic population
- The hypoglycemic population exercises more than the non-hypoglycemic population. This may imply that tight control may be one of the reasons for hypoglycemia.
- The hypoglycemic population was significantly irregular in their day to day activities than their counterparts.
- The hypoglycemic population snacked more than their counterparts.
- The hypoglycemic population had significantly more blood glucose readings than their counterparts. This may suggest that strict control may be causing hypoglycemia in these patients.
- The hypoglycemic population takes more insulin shots per day. This may suggest irregularities.
- NPH insulin users have slightly higher chances of experiencing hypoglycemia than non users.
- Most hypoglycemic events happen during the afternoon 12 noon and 6pm
- The hypoglycemic population had significantly higher blood glucose levels post lunch on non hypoglycemic days than on hypoglycemic days
Preliminary Analysis - Inferential Statistics
A correlation matrix was created using the most significant features that were got from EDA. Below is the most significant features as got from correlation matrix. This features will be used to create the machine learning models.
Code and detailed comments can be seen in this link.
- Irregular diet.
- NPH insulin
- Median blood glucose level.
- Irregular exercise.
Random forest model seems to be the best algorithm for this problem.
Why Random Forest?
From EDA conclusions, we can infer that patients experience hypoglycemia for a number of independent reasons.
- Being irregular with diet, exercise and snacking.
- Strict control - People who are very conscious about their day to day activities also experience hypoglycemia as their efforts maintain optimal blood glucose levels sometimes lead to very low blood glucose levels which leads to hypoglycemic symptoms.
- NPH insulin - People on NPH insulin are more prone to hypoglycemia.
We can see that there are three different sub populations within the hypoglycemic population. The number of categorical feature is one one. This makes random forest a best fit for the problem than other classifiers like SVM or KNN.
A random forest model was created using the following features.
- Irregular diet
- Median blood glucose level.
- Number of NPH insulin shots per day
- Number of readings per day.
- Number of snack times per day
Code and detailed comments can be seen in this link.
The dataset was divided into 60/40 Train test split. The model was able to predict with a maximum of 82% accuracy(AUROC). The ROC-AOC curve is displayed below.
|n=70||Predicted Hypoglycemic||Predicted Non-Hypoglycemic|
Hypoglycemia has always been the toughest problem for both the diabetic patients and the doctors. The doctors are helpless when it comes to glucose control as the lab reports can not convey the entire picture. With the latest advances in technology lab reports can now say the blood glucose levels at a certain point of time and the average blood glucose level for the past 3 months(Hba1c test). These reports can not say anything about a patient's risk of hypoglycemia or diabetic neuropathy(long term complications like kidney and eye damage).
The patient's only option left is to check his glucose levels as frequent as possible. So a large data is generated. But, this data is left mostly unused and untouched. Medical science is yet to understand why few diabetic patients are more prone to hypoglycemia and diabetic neuropathy and other patients dont. Billions of dollors are spent every year on research and control of the disease. Even with millions of diabetics around the world medical science still does not know its causes or its cure. Can data science be the missing link to solve these riddles? I strongly believe so.
Our study focuses on helping out the doctors and patients to better control the disease. In our study of seventy patients we have identified sub populations within the hypoglycemia prone diabetics. We have identified few day to day habits that are major causes. Below is our recommendations to the diabetics from our study.
- Be regular in your day to day activities. Maintain a pattern with diet and exercise.
- Snack less. Snacks are known to contain a lot of carbs. Lesser the snacks greater the control.
- NPH insulin users can discuss with the physicians for alternate treatments.
- A strict control may not be neccessary. blood glucose target levels can be slightly higher for diabetics than the normal population.
This model with 82%(AUROC) accuracy should be very helpful for physicians to identify at risk hypoglycemic patients and treat them better. Further studies are required in this area to help the patients and physicians.