CDC Diabetes Health Indicators

This project is a midterm project created by Nadia Paz as part of Machine Learning Zoomcamp

Tech stack for the project:

Icons for Technologies are from:

Expand

Seaborn
Matplotlib cleanpng.com
Sci-Kit Learn pngegg.com
Everything else is from Icons8
- Python
- Pandas
- NumPy
- Docker

1. Project's description

Expand

Diabetes and prediabetes are national epidemics impacting more than 133 million Americans, and diabetes is one of the fastest growing chronic diseases in the world (source). Type 2 diabetes is largely preventable by taking several simple steps: keeping weight under control, exercising more, eating a healthy diet, and not smoking (source). This project involves CDC healthcare statistics and information gathered from lifestyle surveys. The data I am working with includes basic demographics and health questionnaire responses from participants, where they answer questions about their activities, lifestyle, health indicators, and their diabetes condition (diabetes, pre-diabetes, or healthy). The information has been adjusted for binary classification and has only two possible outcomes: diabetes or healthy. The objective of this project is to create a machine learning classification model that is capable of predicting the probability of contracting diabetes based on health and lifestyle data. The purpose of this project is to raise awareness about diabetes and encourage people to adopt a healthier lifestyle.

2. Data Source and Aquisition

Expand

I downloaded the data for the project from the UC Irvine Machine Learning Repository. If you're interested in obtaining the same data, you can access it in the "data" folder of this project or visit the UC Irvine website and follow the instructions outlined in the "Import in Python" section. After getting X an y variables, I merged the data into a data frame and saved it as a csv file with following code:

import os
# merge data
df = pd.concat([X, y], axis = 1)
# save to csv
df.to_csv('diabetis_data.csv', index_label=False)

The same data is also available on Kaggle, but be aware that the column names and order differ from those used in the project.

Data description can be find on Kaggle or at Behavioral Risk Factor Surveillance System 2015 Codebook Report

Data dictionary

Column name	Definition	Numer of Unique Values	Data Type
`HighBP`	High blood pressure	2	int
`HighChol`	High colesterol	2	int
`CholCheck`	Cholesterol check whithin the last 5 years	2	int
`BMI`	Body mass index. Normal range is 18.5 - 24.9	continuous variable	int
`Smoker`	The respondent smoked at least 100 cigarettes throu his/her life	2	int
`Stroke`	Ever had stroke	2	int
`HeartDiseaseorAttack`	The history of heart desease or heart attack	2	int
`PhysActivity`	Physical activity in past 30 days not including job	2	int
`Fruits`	Consume 1 or more fruit per day	2	int
`Veggies`	Consume 1 or more vegetables per day	2	int
`HvyAlcoholConsump`	Heavy alcohol consumption. Men: >= 14 drinks per week Women >= 7 drinks per week	2	int
`AnyHealthcare`	Healthcare coverage (insurance, medical plans)	2	int
`NoDocbcCost`	No doctor beacuse of cost within past 12 month	2	int
`GenHlth`	Would you say that in general your health is: scale 1-5 `1` = excellent `2` = very good `3` = good `4` = fair `5` = poor	5	int
`MentHlth`	Days of poor mental health per month (within past 30 days?)	31	int
`PhysHlth`	Physical illness or injury whithin past 30 days	31	int
`DiffWalk`	Serious difficulty walking or climbing stairs	2	int
`Sex`	How many times per week do you have sex? (Kidding) Gender: 0- female, 1 - male	2	int
`Age`	Respondent's age by category `1`: 18-24 `2`: 25-29 `3`: 30-34 `4`: 35-39 `5`: 40-44 `6`: 45-49 `7`: 50-54 `8`: 55-59 `9`: 60-64 `10`: 65-69 `11`: 70-74 `12`: 75-80 `13`: 80 years and older	13	int
`Education`	Education level by category: `1`, `2`, `3`: didn't graduate from high school `4`: graduated from high school `5`: attended college `6`: graduated from college	6	int
`Income`	Income level by category: `1`: less than $10,000 `2`: more than $10,000 less than $15,000 `3`: more than $15,000 less than $20,000 `4`: more than $20,000 less than $25,000 `5`: more than $25,000 less than $35,000 `6`: more than $35,000 less than $50,000 `7`: more than $50,000 less than $75,000 `8`: more than $75,000	8	int
Target Variable

`Diabetes_binary`	The respondent has diabetis	2	int

Data manipulations

The original data consists from 253,680 rows and 22 columns. I dropped 24,206 duplicated rows, that leaves us with 229,474 rows.

3. Download the project

Expand

You can download it from this GitHub repository by selecting Code -> Download ZIP, or run the command git clone git@github.com:nadia-paz/cdc_diabetis.git cd cdc_diabetis to move to the project's directory.
git clone

4. Virtual environtments

Expand

Anaconda

The project is made using Python 3.9.18 on Anaconda. To create the same virtual environment with Anaconda please refer to the file environment.yml. Install Anaconda or Mamba if you don't have it yet and run the following command in your terminal from the project's directory:

conda env create -f environment.yml

The name of the environment already is specified in the file. After installing the environment activate it with the command:

conda activate diabetes_project

and start jupyter notebook

To deactivate the environment: conda deactivate

`venv`

If you don't have Anaconda or don't want to use it, you can install required dependencies using Python's venv. They are located in the venv_requirements.txt file.

Step 1: Install Python 3.9

Check your Python's version in your terminal: python --version or python3 --version. If it is different from the Python 3.9.*, install Python 3.9 on your computer according with your operation system instructions. For Linux sudo apt-get install python3.9, for Mac brew install python@3.9, for Windows manually download and install the required Python's version.

Step 2: Locate the path of your Python3.9

Run in your terminal which python3.9. Copy the output. It is your path_to_python

Step 3: Create a virtual environment

In your terminal move to the projects folder cd <path_to_the_project>.
Create the environment. In your terminal run the command

<path_to_python> -m vevn <env_name>

Step 4: Activate the virtual environment

In terminal run:

source <env_name>/bin/activate

Step 5: Install dependendencies

Make sure that you are in the project's directory and you have the venv_requirements.txt file in it. Run the following command in the terminal:

python -m pip install -r venv_requirements.txt

Now you can use the project. To deactivate the virtual environment simpy run deactivate in the terminal.

Confirm virual environment from `jupyter notebook`

In the code cell run:

import os
print(os.system("which python"))
print(os.system("python --version"))

You should see the name of your environment in the output. If you don't, confirm the installation of the environment to the iPython Kernell. In the terminal window run:

ipython kernel install --user --name=<env_name>

4. Project's files

Expand

├── data
│   └── diabetis_data.csv
│   ├── icons8-aws-logo-96.png
│   ├── icons8-docker-logo-96.png
│   ├── icons8-numpy-logo-96.png
│   ├── icons8-pandas-logo-96.png
│   ├── icons8-python-logo-96.png
│   ├── matplotlib.png
│   ├── seaborn.svg
│   ├── sklearn.png
├── deployment
│   ├── Dockerfile
│   ├── Pipfile
│   ├── Pipfile.lock
│   ├── encoder.bin
│   ├── model.bin
│   ├── predict.py
│   ├── test.py
│   └── test_aws.py
├── src
│   ├── data_prep.py
│   ├── explore.py
│   ├── model.py
│   └── transform.py
├── environment.yml
├── notebook.ipynb
├── train.py
├── use_model.py
└── venv_requirements.txt
├── README.md

Directories:

data: contains *.csv file with the data
deployment: contains binary files with the model and OneHotEncoder that was fit on the train set, files to use the model on Docker, and test_aws.py that can be copied anywhere and is used to send requests to the web-zpplication hosted on AWS.
src contains files that assist data preparation data_prep.py, exploration explore.py, model tuning model.py, and transformation of the single patient from the dictionary into numpy array ready to use in the model.

The project's main notebook with step by step exploration, tuning and saving the model is notebook.ipynb. The script needed to build the model is in train.py file. It contains the code from the notebook and files located in src directory, that was used for the model development.

5. How to use the model

You can use the model in the virtual environment of the project, on Docker or send a request to the AWS web service (temporary option and will be deprecated soon). Every file that is used for testing contains a dictionary patient with the basic information. If you decide to change that information, please, refer to the Data Dictionary prior to making changes (part 2. Data Source and Aquisition of this Readme file).

Virtual environment of the project

On your terminal move to the project's directory, activate the virtual environment and run the command python use_model.py or python3 use_model.py.

Docker

Download and install Docker if you don't have it on your machine.
The virtual environment and deployment files are located in the directory deployment.
Build Docker image
- On your termnial move to the directory "deployment".
- Run the following command:
  
  docker build -t diabetes-project .
  
  This will build the Docker image on your machine.
  - In case you run into "Permission denied" error, re-run the command as a superuser":
  sudo docker build -t diabetes-project .
Next, run Docker image.

docker run --rm -p 2912:2912 diabetes-project

This command launches the container with the diabetes prediction model and listens for your requests on localhost port 2912. To send this request you have to open a new terminal window, move to the deployment directory and run the script test.py.

Web application

This model is deployed as a web application on AWS Elastic Beanstalk. To use it run test_aws.py file located in deployment folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDC Diabetes Health Indicators

1. Project's description

2. Data Source and Aquisition

Data dictionary

Data manipulations

3. Download the project

4. Virtual environtments

Anaconda

`venv`

Confirm virual environment from `jupyter notebook`

4. Project's files

5. How to use the model

Virtual environment of the project

Docker

Web application

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data		data
deployment		deployment
src		src
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
notebook.ipynb		notebook.ipynb
train.py		train.py
use_model.py		use_model.py
venv_requirements.txt		venv_requirements.txt

License

nadia-paz/cdc_diabetes

Folders and files

Latest commit

History

Repository files navigation

CDC Diabetes Health Indicators

1. Project's description

2. Data Source and Aquisition

Data dictionary

Data manipulations

3. Download the project

4. Virtual environtments

Anaconda

venv

Confirm virual environment from jupyter notebook

4. Project's files

5. How to use the model

Virtual environment of the project

Docker

Web application

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`venv`

Confirm virual environment from `jupyter notebook`

Packages