This project is a midterm project created by Nadia Paz as part of Machine Learning Zoomcamp
Tech stack for the project:
Icons for Technologies are from:
Expand
- Seaborn
- Matplotlib cleanpng.com
- Sci-Kit Learn pngegg.com
- Everything else is from Icons8
Expand
Diabetes and prediabetes are national epidemics impacting more than 133 million Americans, and diabetes is one of the fastest growing chronic diseases in the world (source). Type 2 diabetes is largely preventable by taking several simple steps: keeping weight under control, exercising more, eating a healthy diet, and not smoking (source). This project involves CDC healthcare statistics and information gathered from lifestyle surveys. The data I am working with includes basic demographics and health questionnaire responses from participants, where they answer questions about their activities, lifestyle, health indicators, and their diabetes condition (diabetes, pre-diabetes, or healthy). The information has been adjusted for binary classification and has only two possible outcomes: diabetes or healthy. The objective of this project is to create a machine learning classification model that is capable of predicting the probability of contracting diabetes based on health and lifestyle data. The purpose of this project is to raise awareness about diabetes and encourage people to adopt a healthier lifestyle.Expand
I downloaded the data for the project from the UC Irvine Machine Learning Repository. If you're interested in obtaining the same data, you can access it in the "data" folder of this project or visit the UC Irvine website and follow the instructions outlined in the "Import in Python" section. After getting X
an y
variables, I merged the data into a data frame and saved it as a csv
file with following code:
import os
# merge data
df = pd.concat([X, y], axis = 1)
# save to csv
df.to_csv('diabetis_data.csv', index_label=False)
The same data is also available on Kaggle, but be aware that the column names and order differ from those used in the project.
Data description can be find on Kaggle or at Behavioral Risk Factor Surveillance System 2015 Codebook Report
Column name | Definition | Numer of Unique Values | Data Type |
---|---|---|---|
HighBP |
High blood pressure | 2 | int |
HighChol |
High colesterol | 2 | int |
CholCheck |
Cholesterol check whithin the last 5 years | 2 | int |
BMI |
Body mass index. Normal range is 18.5 - 24.9 | continuous variable | int |
Smoker |
The respondent smoked at least 100 cigarettes throu his/her life | 2 | int |
Stroke |
Ever had stroke | 2 | int |
HeartDiseaseorAttack |
The history of heart desease or heart attack | 2 | int |
PhysActivity |
Physical activity in past 30 days not including job | 2 | int |
Fruits |
Consume 1 or more fruit per day | 2 | int |
Veggies |
Consume 1 or more vegetables per day | 2 | int |
HvyAlcoholConsump |
Heavy alcohol consumption. Men: >= 14 drinks per week Women >= 7 drinks per week |
2 | int |
AnyHealthcare |
Healthcare coverage (insurance, medical plans) | 2 | int |
NoDocbcCost |
No doctor beacuse of cost within past 12 month | 2 | int |
GenHlth |
Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor |
5 | int |
MentHlth |
Days of poor mental health per month (within past 30 days?) | 31 | int |
PhysHlth |
Physical illness or injury whithin past 30 days | 31 | int |
DiffWalk |
Serious difficulty walking or climbing stairs | 2 | int |
Sex |
How many times per week do you have sex? (Kidding) Gender: 0- female, 1 - male |
2 | int |
Age |
Respondent's age by category1 : 18-24 2 : 25-29 3 : 30-34 4 : 35-39 5 : 40-44 6 : 45-49 7 : 50-54 8 : 55-59 9 : 60-64 10 : 65-69 11 : 70-74 12 : 75-80 13 : 80 years and older |
13 | int |
Education |
Education level by category:1 , 2 , 3 : didn't graduate from high school 4 : graduated from high school 5 : attended college 6 : graduated from college |
6 | int |
Income |
Income level by category:1 : less than $10,000 2 : more than $10,000 less than $15,000 3 : more than $15,000 less than $20,000 4 : more than $20,000 less than $25,000 5 : more than $25,000 less than $35,000 6 : more than $35,000 less than $50,0007 : more than $50,000 less than $75,000 8 : more than $75,000 |
8 | int |
Target Variable | |||
Diabetes_binary |
The respondent has diabetis | 2 | int |
The original data consists from 253,680 rows and 22 columns. I dropped 24,206 duplicated rows, that leaves us with 229,474 rows.
Expand
You can download it from this GitHub repository by selecting Code
-> Download ZIP
, or run the command git clone git@github.com:nadia-paz/cdc_diabetis.git
cd cdc_diabetis
to move to the project's directory.
git clone
Expand
The project is made using Python 3.9.18 on Anaconda. To create the same virtual environment with Anaconda please refer to the file environment.yml
. Install Anaconda or Mamba if you don't have it yet and run the following command in your terminal from the project's directory:
conda env create -f environment.yml
The name of the environment already is specified in the file. After installing the environment activate it with the command:
conda activate diabetes_project
and start jupyter notebook
To deactivate the environment: conda deactivate
If you don't have Anaconda or don't want to use it, you can install required dependencies using Python's venv
. They are located in the venv_requirements.txt
file.
Step 1: Install Python 3.9
Check your Python's version in your terminal: python --version
or python3 --version
. If it is different from the Python 3.9.*, install Python 3.9 on your computer according with your operation system instructions. For Linux sudo apt-get install python3.9
, for Mac brew install python@3.9
, for Windows manually download and install the required Python's version.
Step 2: Locate the path of your Python3.9
Run in your terminal which python3.9
. Copy the output. It is your path_to_python
Step 3: Create a virtual environment
- In your terminal move to the projects folder
cd <path_to_the_project>
. - Create the environment. In your terminal run the command
<path_to_python> -m vevn <env_name>
Step 4: Activate the virtual environment
In terminal run:
source <env_name>/bin/activate
Step 5: Install dependendencies
Make sure that you are in the project's directory and you have the venv_requirements.txt
file in it. Run the following command in the terminal:
python -m pip install -r venv_requirements.txt
Now you can use the project. To deactivate the virtual environment simpy run deactivate
in the terminal.
In the code cell run:
import os
print(os.system("which python"))
print(os.system("python --version"))
You should see the name of your environment in the output. If you don't, confirm the installation of the environment to the iPython Kernell. In the terminal window run:
ipython kernel install --user --name=<env_name>
Expand
├── data
│ └── diabetis_data.csv
│ ├── icons8-aws-logo-96.png
│ ├── icons8-docker-logo-96.png
│ ├── icons8-numpy-logo-96.png
│ ├── icons8-pandas-logo-96.png
│ ├── icons8-python-logo-96.png
│ ├── matplotlib.png
│ ├── seaborn.svg
│ ├── sklearn.png
├── deployment
│ ├── Dockerfile
│ ├── Pipfile
│ ├── Pipfile.lock
│ ├── encoder.bin
│ ├── model.bin
│ ├── predict.py
│ ├── test.py
│ └── test_aws.py
├── src
│ ├── data_prep.py
│ ├── explore.py
│ ├── model.py
│ └── transform.py
├── environment.yml
├── notebook.ipynb
├── train.py
├── use_model.py
└── venv_requirements.txt
├── README.md
Directories:
data
: contains*.csv
file with the datadeployment
: contains binary files with the model and OneHotEncoder that was fit on thetrain
set, files to use the model on Docker, andtest_aws.py
that can be copied anywhere and is used to send requests to the web-zpplication hosted on AWS.src
contains files that assist data preparationdata_prep.py
, explorationexplore.py
, model tuningmodel.py
, and transformation of the single patient from the dictionary intonumpy
array ready to use in the model.
The project's main notebook with step by step exploration, tuning and saving the model is notebook.ipynb
. The script needed to build the model is in train.py
file. It contains the code from the notebook and files located in src
directory, that was used for the model development.
You can use the model in the virtual environment of the project, on Docker or send a request to the AWS web service (temporary option and will be deprecated soon). Every file that is used for testing contains a dictionary patient
with the basic information. If you decide to change that information, please, refer to the Data Dictionary prior to making changes (part 2. Data Source and Aquisition of this Readme
file).
On your terminal move to the project's directory, activate the virtual environment and run the command python use_model.py
or python3 use_model.py
.
-
Download and install Docker if you don't have it on your machine.
-
The virtual environment and deployment files are located in the directory
deployment
. -
Build Docker image
-
On your termnial move to the directory "deployment".
-
Run the following command:
docker build -t diabetes-project .
This will build the Docker image on your machine.
- In case you run into "Permission denied" error, re-run the command as a superuser":
sudo docker build -t diabetes-project .
-
-
Next, run Docker image.
docker run --rm -p 2912:2912 diabetes-project
This command launches the container with the diabetes prediction model and listens for your requests on localhost port
2912
. To send this request you have to open a new terminal window, move to the deployment directory and run the scripttest.py
.
This model is deployed as a web application on AWS Elastic Beanstalk. To use it run test_aws.py
file located in deployment folder.