Skip to content

nadia-paz/cdc_diabetes

Repository files navigation

CDC Diabetes Health Indicators

This project is a midterm project created by Nadia Paz as part of Machine Learning Zoomcamp

Tech stack for the project:

PythonPandasNumPyMatplotlib Seaborn Sci-Kit LearnDockerAWS

Icons for Technologies are from:

Expand

1. Project's description

Expand Diabetes and prediabetes are national epidemics impacting more than 133 million Americans, and diabetes is one of the fastest growing chronic diseases in the world (source). Type 2 diabetes is largely preventable by taking several simple steps: keeping weight under control, exercising more, eating a healthy diet, and not smoking (source). This project involves CDC healthcare statistics and information gathered from lifestyle surveys. The data I am working with includes basic demographics and health questionnaire responses from participants, where they answer questions about their activities, lifestyle, health indicators, and their diabetes condition (diabetes, pre-diabetes, or healthy). The information has been adjusted for binary classification and has only two possible outcomes: diabetes or healthy. The objective of this project is to create a machine learning classification model that is capable of predicting the probability of contracting diabetes based on health and lifestyle data. The purpose of this project is to raise awareness about diabetes and encourage people to adopt a healthier lifestyle.

2. Data Source and Aquisition

Expand

I downloaded the data for the project from the UC Irvine Machine Learning Repository. If you're interested in obtaining the same data, you can access it in the "data" folder of this project or visit the UC Irvine website and follow the instructions outlined in the "Import in Python" section. After getting X an y variables, I merged the data into a data frame and saved it as a csv file with following code:

import os
# merge data
df = pd.concat([X, y], axis = 1)
# save to csv
df.to_csv('diabetis_data.csv', index_label=False)

The same data is also available on Kaggle, but be aware that the column names and order differ from those used in the project.

Data description can be find on Kaggle or at Behavioral Risk Factor Surveillance System 2015 Codebook Report

Data dictionary

Column name Definition Numer of Unique Values Data Type
HighBP High blood pressure 2 int
HighChol High colesterol 2 int
CholCheck Cholesterol check whithin the last 5 years 2 int
BMI Body mass index. Normal range is 18.5 - 24.9 continuous variable int
Smoker The respondent smoked at least 100 cigarettes throu his/her life 2 int
Stroke Ever had stroke 2 int
HeartDiseaseorAttack The history of heart desease or heart attack 2 int
PhysActivity Physical activity in past 30 days not including job 2 int
Fruits Consume 1 or more fruit per day 2 int
Veggies Consume 1 or more vegetables per day 2 int
HvyAlcoholConsump Heavy alcohol consumption.
Men: >= 14 drinks per week
Women >= 7 drinks per week
2 int
AnyHealthcare Healthcare coverage (insurance, medical plans) 2 int
NoDocbcCost No doctor beacuse of cost within past 12 month 2 int
GenHlth Would you say that in general your health is: scale 1-5
1 = excellent
2 = very good
3 = good
4 = fair
5 = poor
5 int
MentHlth Days of poor mental health per month (within past 30 days?) 31 int
PhysHlth Physical illness or injury whithin past 30 days 31 int
DiffWalk Serious difficulty walking or climbing stairs 2 int
Sex How many times per week do you have sex? (Kidding)
Gender: 0- female, 1 - male
2 int
Age Respondent's age by category
1: 18-24
2: 25-29
3: 30-34
4: 35-39
5: 40-44
6: 45-49
7: 50-54
8: 55-59
9: 60-64
10: 65-69
11: 70-74
12: 75-80
13: 80 years and older
13 int
Education Education level by category:
1, 2, 3: didn't graduate from high school
4: graduated from high school
5: attended college
6: graduated from college
6 int
Income Income level by category:
1: less than $10,000
2: more than $10,000 less than $15,000
3: more than $15,000 less than $20,000
4: more than $20,000 less than $25,000
5: more than $25,000 less than $35,000
6: more than $35,000 less than $50,000
7: more than $50,000 less than $75,000
8: more than $75,000
8 int
Target Variable
Diabetes_binary The respondent has diabetis 2 int

Data manipulations

The original data consists from 253,680 rows and 22 columns. I dropped 24,206 duplicated rows, that leaves us with 229,474 rows.

3. Download the project

Expand

You can download it from this GitHub repository by selecting Code -> Download ZIP, or run the command git clone git@github.com:nadia-paz/cdc_diabetis.git cd cdc_diabetis to move to the project's directory.
git clone

4. Virtual environtments

Expand

Anaconda

The project is made using Python 3.9.18 on Anaconda. To create the same virtual environment with Anaconda please refer to the file environment.yml. Install Anaconda or Mamba if you don't have it yet and run the following command in your terminal from the project's directory:

conda env create -f environment.yml

The name of the environment already is specified in the file. After installing the environment activate it with the command:

conda activate diabetes_project

and start jupyter notebook

To deactivate the environment: conda deactivate

venv

If you don't have Anaconda or don't want to use it, you can install required dependencies using Python's venv. They are located in the venv_requirements.txt file.

Step 1: Install Python 3.9

Check your Python's version in your terminal: python --version or python3 --version. If it is different from the Python 3.9.*, install Python 3.9 on your computer according with your operation system instructions. For Linux sudo apt-get install python3.9, for Mac brew install python@3.9, for Windows manually download and install the required Python's version.

Step 2: Locate the path of your Python3.9

Run in your terminal which python3.9. Copy the output. It is your path_to_python

Step 3: Create a virtual environment

  1. In your terminal move to the projects folder cd <path_to_the_project>.
  2. Create the environment. In your terminal run the command
<path_to_python> -m vevn <env_name>

Step 4: Activate the virtual environment

In terminal run:

source <env_name>/bin/activate

Step 5: Install dependendencies

Make sure that you are in the project's directory and you have the venv_requirements.txt file in it. Run the following command in the terminal:

python -m pip install -r venv_requirements.txt

Now you can use the project. To deactivate the virtual environment simpy run deactivate in the terminal.

Confirm virual environment from jupyter notebook

In the code cell run:

import os
print(os.system("which python"))
print(os.system("python --version"))

You should see the name of your environment in the output. If you don't, confirm the installation of the environment to the iPython Kernell. In the terminal window run:

ipython kernel install --user --name=<env_name>

4. Project's files

Expand
├── data
│   └── diabetis_data.csv
│   ├── icons8-aws-logo-96.png
│   ├── icons8-docker-logo-96.png
│   ├── icons8-numpy-logo-96.png
│   ├── icons8-pandas-logo-96.png
│   ├── icons8-python-logo-96.png
│   ├── matplotlib.png
│   ├── seaborn.svg
│   ├── sklearn.png
├── deployment
│   ├── Dockerfile
│   ├── Pipfile
│   ├── Pipfile.lock
│   ├── encoder.bin
│   ├── model.bin
│   ├── predict.py
│   ├── test.py
│   └── test_aws.py
├── src
│   ├── data_prep.py
│   ├── explore.py
│   ├── model.py
│   └── transform.py
├── environment.yml
├── notebook.ipynb
├── train.py
├── use_model.py
└── venv_requirements.txt
├── README.md

Directories:

  • data: contains *.csv file with the data
  • deployment: contains binary files with the model and OneHotEncoder that was fit on the train set, files to use the model on Docker, and test_aws.py that can be copied anywhere and is used to send requests to the web-zpplication hosted on AWS.
  • src contains files that assist data preparation data_prep.py, exploration explore.py, model tuning model.py, and transformation of the single patient from the dictionary into numpy array ready to use in the model.

The project's main notebook with step by step exploration, tuning and saving the model is notebook.ipynb. The script needed to build the model is in train.py file. It contains the code from the notebook and files located in src directory, that was used for the model development.

5. How to use the model

You can use the model in the virtual environment of the project, on Docker or send a request to the AWS web service (temporary option and will be deprecated soon). Every file that is used for testing contains a dictionary patient with the basic information. If you decide to change that information, please, refer to the Data Dictionary prior to making changes (part 2. Data Source and Aquisition of this Readme file).

Virtual environment of the project

On your terminal move to the project's directory, activate the virtual environment and run the command python use_model.py or python3 use_model.py.

Docker

  1. Download and install Docker if you don't have it on your machine.

  2. The virtual environment and deployment files are located in the directory deployment.

  3. Build Docker image

    • On your termnial move to the directory "deployment".

    • Run the following command:

      docker build -t diabetes-project .

      This will build the Docker image on your machine.

      • In case you run into "Permission denied" error, re-run the command as a superuser":

      sudo docker build -t diabetes-project .

  4. Next, run Docker image.

    docker run --rm -p 2912:2912 diabetes-project

    This command launches the container with the diabetes prediction model and listens for your requests on localhost port 2912. To send this request you have to open a new terminal window, move to the deployment directory and run the script test.py.

Web application

This model is deployed as a web application on AWS Elastic Beanstalk. To use it run test_aws.py file located in deployment folder.

About

CDC Diabetes Health Indicators

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published