DiaMeter is a graduation project submitted to College of Computer and Information Technology, Shaqra University in partial fulfillment of the requirements for the degree of Bachelor in Computer Science.
- Introduction
- DiaMeter's Dataset and Database
- DiaMeter's Machine Learning Model
- DiaMeter's User Interface
- Tools and Technologies Involved
- Design Overview (UI / UX)
- Installation Guide
Since diabetes has become one of the most common causes of death and the cause of many other serious diseases. And the significant increase in the number of diabetics in the world in general and in Saudi Arabia in particular. It become important that an advanced system be implemented and used to predict diabetes and monitor diabetes patients. In addition, this system should be accessible and usable for non-professionals.
Doctors diagnose diabetes by examining the symptoms that appear on patients and looking at the test results. In addition, specialists can determine a person's risk of developing diabetes by looking at the influencing factors. And since diabetes is one of the most diseases that need continuous routine follow-up, patients must visit the doctor continuously to see if any positive or negative complications of the disease occur and adjust their lifestyle accordingly.
There are two main objectives of the project. First, Diameter will allow everyone to predict the risk of developing diabetes by answering the related questions, such as age, weight, blood pressure. We used VotingClassifier model to combine conceptually different machine learning classifiers (LGBMClassifier, KNeighborsClassifier and GradientBoostingClassifier) and use the average predicted probabilities to balance out their individual weaknesses. And by training this model on The Pima Indians Diabetes dataset from The National Institute of Diabetes and Digestive and Kidney Diseases. DiaMeter will analyze these values and give an accurate prediction and the advice that must be followed to improve their lifestyle.
Additionally, DiaMeter will allow diabetics and doctors to follow-up: the blood sugar levels, change in weight, blood pressure level, and help them to adjust their lifestyle by calculating the appropriate calories for each person according to their age, weight and the goal they seek to reach.
Data Source: The Pima Indians Diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
The dataset has 768 samples of diabetic and healthy individuals.
The dataset consist of several medical predictor variables such as BMI, blood pressure level, age. In addition to one target variable which is the outcome.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
As we can see, the zero values in some of the variables is meaningless. It indicates missing values. Ignoring them can significantly affect the accuracy of the model we are trying to build.
There are two common ways of handling missing data, which are; entirely removing the observations from the data set or imputing a new value based on other observations.
We replaced the missing values with the median value for each feature after excluding the missing data. Taking into consideration the target value; to get the most accurate result possible.
Features | Records with 0 Value | % of Total Records | Median for Outcome 0 | Median for Outcome 1 |
---|---|---|---|---|
Glucose | 5 | 0.65% | 107.0 | 140.0 |
BloodPressure | 35 | 4.56% | 70.0 | 74.5 |
SkinThickness | 227 | 29.56% | 27.0 | 32.0 |
Insulin | 374 | 48.7% | 102.5 | 169.5 |
BMI | 11 | 1.43% | 30.1 | 34.3 |
Data transformation is one of the fundamental steps in processing. It can significantly improve the accuracy of the machine learning model.
We used data standardization to rescale the distribution of values and export the model to use it with the new input data.
Using K-Folds Cross Validation we split our data into k (5) different subsets (folds). We use k-1 folds to train our data and leave the last fold as test data. We then average the model against each of the folds and then finalize our model.
The database is a basic pillar as the second goal of the project is to follow up on diabetics, where their information and medical records, clinics and doctors' data, login information for all users will be stored. In order for doctors to view the reports when needed, and for DiaMeter to give appropriate instructions to each user.
In this project we used MySQL, which is the world’s most popular open source database.
We used VotingClassifier model to combine conceptually different machine learning classifiers (LGBMClassifier, KNeighborsClassifier and GradientBoostingClassifier) and use the average predicted probabilities to balance out their individual weaknesses.
Model’s Performance | |||
---|---|---|---|
Accuracy | 89.6% | Precision | 86.4% |
Recall | 83.6% | F1 score | 84.9% |
Confusion Matrix | |||
True Positives | 224 | False Positives | 36 |
True Negatives | 464 | False Negatives | 44 |
User interface is important to meet user expectations and support the effective functionality of DiaMeter.
That's why we used Flask which is a lightweight WSGI web application framework. Although it doesn’t include an ORM (Object Relational Manager) or such features. It does have many cool features like url routing, template engine. And it is considered as one of the most popular Python web application frameworks.
- macOS: Operating System
- PyCharm: Python IDE
- TablePlus: Database Client
- iA Writer, Pages: Documentation
- Keynote: Presentation
- StarUML, OmniPlan, MindNode, Pixelmator: Designing
- GitHub: Version Control
- MySQL: Project Database
- Python: Project Backend
- Pandas: for data analysis and manipulation
- NumPy: for scientific computing
- Scikit-learn: for machine learning, including data processing, metrics and modeling
- LightGBM: for machine learning modeling
- Plotly: for data analysis and visualization
- Flask: for web application development
- Install MySQL
- Import DiaMeter's Database by running the following command line:
mysql -u root -p DiaMeter < data/DiaMeter.dump
- Install Python
- Install the required libraries by running the following command line:
pip install -r requirements.txt
- (Optional) Install PyCharm and import the project and run
src/main.py
- Run the project using the following command line:
python src/main.py
- You can now access the flask server at
http://127.0.0.1:5000/