<h1 style="color:#b01e3a;">Diabetes Risk Prediction</h2>

Prediction of diabetis risk based on the healthcare indicators. Machine learning project.

__Table of Contents__

1. [About the project](#about)
2. [Load data](https://github.com/nadia-paz/cdc_diabetes/blob/main/notebook.ipynb#loaddata)
3. [Exploratory Data Analysis](#eda)
4. Creating and Tuning ML models
    - Logistic Regression
    - Decision Tree
    - Random Forest
    - XGBoost
5. Training the Final model

<a name="about"></a> 
### About the project

> The project involves working with BRFSS data from CDC. The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that collects data about US residents' health-related risk behaviors, chronic health conditions, and use of preventive services. If you want to learn more about the data and its source, please refer to the `Readme` file of the project. <br>

- The main objective of the project is to develop a __machine-learning model__ that can *predict the risk of developing diabetes based on health indicators*. 
- As a secondary objective, this project aims to make a positive impact on people's lives by using the power of machine learning to predict the risks of developing diabetes. By analyzing health indicators and risk behaviors, I hope to create a model that can help individuals take proactive steps toward their health and well-being.


In [1]:
# imports
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format='retina'

from scipy import stats 
from sklearn.metrics import mutual_info_score

# load data preparation module
import src.data_prep as dp 

<a class="anchor" id="loaddata"></a>
### Load data

The data preparation code is located in the `data_prep.py` file in the `src` directory. It loads data, drops duplicates, converts data types, and splits data into `train`, `validation`, and `test` sets.

In [2]:
# load the data
df = dp.acquire()
df.head()

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,Diabetes_binary
0,1,1,1,40,1,0,0,0,0,1,...,0,5,18,15,1,0,9,4,3,0
1,0,0,0,25,1,0,0,1,0,0,...,1,3,0,0,0,0,7,6,1,0
2,1,1,1,28,0,0,0,0,1,0,...,1,5,30,30,1,0,9,4,8,0
3,1,0,1,27,0,0,0,1,1,1,...,0,2,0,0,0,0,11,3,6,0
4,1,1,1,24,0,0,0,1,1,1,...,0,2,3,0,0,0,11,5,4,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 229474 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype   
---  ------                --------------   -----   
 0   HighBP                229474 non-null  category
 1   HighChol              229474 non-null  category
 2   CholCheck             229474 non-null  category
 3   BMI                   229474 non-null  int64   
 4   Smoker                229474 non-null  category
 5   Stroke                229474 non-null  category
 6   HeartDiseaseorAttack  229474 non-null  category
 7   PhysActivity          229474 non-null  category
 8   Fruits                229474 non-null  category
 9   Veggies               229474 non-null  category
 10  HvyAlcoholConsump     229474 non-null  category
 11  AnyHealthcare         229474 non-null  category
 12  NoDocbcCost           229474 non-null  category
 13  GenHlth               229474 non-null  category
 14  MentHlth              229474 non-null  in

Upon initial inspection, the data appeared to mostly consist of digital values and was prepared for machine learning. However, it was not user-friendly for data exploration. Based on the data dictionary provided in the Readme file I replaced numerical values in categorical columns with human-readable information. This makes the process of data exploration easier. Those changes only impact `df_explore` data frame, which excludes any data from the `test` data set, and leaves the original data untouched as well.

<a name="eda"></a> 
### Exploratory Data Analysis