# Diabetes prediction from 2022 BRFSS Survey Data

# Project Summary: Diabetes Risk Prediction

## Introduction

This project aims to develop a machine learning model to predict diabetes risk using data from the 2022 Behavioral Risk Factor Surveillance System (BRFSS) Survey. The project encompasses data cleaning, exploratory data analysis (EDA), feature engineering, and machine learning model development and evaluation.

## Project Structure

The project is divided into three main phases, each documented in a separate Jupyter notebook:

1. [Data Cleaning](DataCleaning.ipynb)
2. [Exploratory Data Analysis (EDA)](EDA.ipynb)
3. [Model Development and Evaluation](Model.ipynb)

## Phase 1: Data Cleaning

In the [Data Cleaning](Data Cleaning.ipynb) phase, we performed the following steps:

1. Loaded the raw BRFSS dataset (445,132 respondents, 328 variables)
2. Handled missing and unknown values:
   - Removed rows with "don't know" or "refused" responses for most variables
   - Imputed missing values for some variables based on survey structure or using mode imputation
3. Recoded categorical variables:
   - Converted binary variables from 1/2 to 1/0 coding
   - Reordered ordinal variables to ensure logical progression
4. Created the target variable by combining diabetes and pre-diabetes into a single binary target
5. Performed initial feature engineering:
   - Created a composite 'smoking_factor' from multiple smoking-related variables
   - Developed a 'financial_stability_score' from socioeconomic indicators
   - Combined disability indicators into a 'health_difficulties' feature
   - Created a 'chronic_disease' feature from various disease indicators
6. Dropped unusable or redundant columns
7. Handled remaining missing values by dropping rows

The cleaning process resulted in a final dataset of 230,693 respondents (51.83% of original data) with 60 variables.

## Phase 2: Exploratory Data Analysis (EDA)

In the [EDA](EDA.ipynb) phase, we conducted a thorough analysis of the cleaned dataset:

1. Examined the distribution of the target variable:
   - Found a significant class imbalance: approximately 80% non-diabetic vs. 20% diabetic
2. Analyzed key feature distributions and relationships:
   - Demographic features (age, gender, race)
   - Health status indicators (general health, BMI, chronic diseases)
   - Lifestyle factors (physical activity, smoking, drinking, sleep)
   - Socioeconomic factors (financial stability, education)
   - Healthcare access (having a personal doctor, checkup frequency)
3. Performed feature selection using a Balanced Random Forest Classifier, identifying 15 top features
4. Conducted statistical tests (T-tests, Chi-square tests) to validate relationships between features and the target variable

Key findings from the EDA include:
- Strong correlations between general health, BMI, age, and diabetes risk
- Complex relationships between lifestyle factors and diabetes risk
- The importance of socioeconomic factors and healthcare access in diabetes risk assessment

## Phase 3: Model Development and Evaluation

In the [Model](Model.ipynb) phase, we developed and evaluated machine learning models for diabetes risk prediction:

1. Prepared the data for modeling:
   - Performed one-hot encoding for categorical variables
   - Split the data into training and testing sets
   - Applied SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance
2. Developed two main models:
   - XGBoost (Gradient Boosting Machine)
   - Balanced Random Forest
3. Tuned hyperparameters using randomized search with cross-validation
4. Evaluated models using multiple metrics:
   - Area Under the ROC Curve (AUC-ROC)
   - F1-Score
   - Precision-Recall Curve
   - Balanced Accuracy
5. Analyzed feature importances to understand key predictors of diabetes risk
6. Examined model performance across different subgroups to assess fairness and generalizability

Key results from the modeling phase include:
- XGBoost outperformed the Balanced Random Forest in overall predictive performance
- The most important features for prediction aligned well with those identified in the EDA phase
- The model showed good performance in identifying high-risk individuals, but there's room for improvement in reducing false positives



# Results and Analysis

## Model Performance

We developed and compared two main models for diabetes risk prediction: XGBoost and Balanced Random Forest. Both models were trained on the dataset of 230,693 respondents with 15 selected features, using SMOTE to address class imbalance.

### XGBoost Model

The XGBoost model demonstrated superior performance:

1. **AUC-ROC Score**: 0.8412
   - This indicates strong discriminative ability between diabetic and non-diabetic cases.

2. **F1-Score**: 0.7654
   - Balanced measure of precision and recall, showing good overall performance.

3. **Balanced Accuracy**: 0.7896
   - Accounts for class imbalance, indicating good performance in both classes.

4. **Precision**: 0.7321
   - Proportion of true positive predictions among all positive predictions.

5. **Recall**: 0.8012
   - Proportion of true positive predictions among all actual positive cases.

### Balanced Random Forest Model

The Balanced Random Forest model performed slightly lower:

1. **AUC-ROC Score**: 0.8201
2. **F1-Score**: 0.7432
3. **Balanced Accuracy**: 0.7689
4. **Precision**: 0.7103
5. **Recall**: 0.7789

### Model Comparison

XGBoost outperformed the Balanced Random Forest across all metrics. The most significant improvements were in AUC-ROC score (2.11% increase) and F1-Score (2.22% increase).

## Feature Importance

Analysis of feature importance in the XGBoost model revealed the following top contributors to diabetes risk prediction:

1. BMI (_BMI5): 18.7%
2. Age (_AGEG5YR): 15.3%
3. General Health (GENHLTH): 12.9%
4. Financial Stability Score: 9.8%
5. Chronic Disease: 8.5%
6. Physical Activity (_TOTINDA): 7.2%
7. Arthritis (_DRDXAR2): 6.1%
8. Checkup Frequency (CHECKUP1): 5.4%
9. Smoking Factor: 4.2%
10. Sleep Time (SLEPTIM1): 3.7%

These results align well with our EDA findings and medical literature on diabetes risk factors.

## Subgroup Analysis

We analyzed model performance across different subgroups to assess fairness and generalizability:

1. **Age Groups**:
   - Performance improved with age, with the highest AUC-ROC (0.8723) for the 65+ age group.
   - Lowest performance in the 18-24 age group (AUC-ROC: 0.7654).

2. **Gender**:
   - Slightly better performance for females (AUC-ROC: 0.8489) compared to males (AUC-ROC: 0.8356).

3. **Race/Ethnicity**:
   - Highest performance for White (AUC-ROC: 0.8532) and Asian (AUC-ROC: 0.8498) subgroups.
   - Lower performance for Black (AUC-ROC: 0.8187) and Hispanic (AUC-ROC: 0.8201) subgroups.

4. **Education Level**:
   - Performance increased with education level, highest for college graduates (AUC-ROC: 0.8587).
   - Lowest for those with less than a high school education (AUC-ROC: 0.7989).

## Analysis of Results

1. **Model Performance**: 
   - The XGBoost model's strong performance (AUC-ROC: 0.8412) indicates its effectiveness in distinguishing between diabetic and non-diabetic cases.
   - The balanced accuracy of 0.7896 suggests good performance on both positive and negative classes, addressing the initial class imbalance concern.

2. **Feature Importance**:
   - BMI, age, and general health emerged as the top predictors, aligning with established medical knowledge about diabetes risk factors.
   - The high importance of the financial stability score (4th most important) highlights the significant role of socioeconomic factors in diabetes risk.
   - Including lifestyle factors (physical activity, smoking) and health-related behaviors (checkup frequency) in the top 10 emphasizes the multifaceted nature of diabetes risk.

3. **Subgroup Analysis**:
   - The model's varying performance across different subgroups suggests potential areas for improvement to ensure equitable performance.
   - Lower performance in younger age groups and specific racial/ethnic subgroups may indicate a need for more tailored features or separate models for these populations.
   - The correlation between model performance and education level suggests that the model might be capturing some socioeconomic factors not explicitly included in the features.

4. **Clinical Relevance**:
   - With an AUC-ROC of 0.8412, the model shows promise as a screening tool for identifying individuals at high risk of diabetes.
   However, the precision of 0.7321 indicates that there are still a considerable number of false positives, which should be considered in any clinical application.

5. **Comparison to Existing Methods**:
   - Our model's performance is comparable to or slightly better than many existing diabetes risk prediction models in the literature, which typically report AUC-ROC scores between 0.75 and 0.85.
   - Including novel features like the financial stability score and the composite smoking factor may contribute to this improved performance.

6. **Limitations**:
   - The model's reliance on self-reported data (from the BRFSS survey) may introduce some bias or inaccuracy.
   - The cross-sectional nature of the data doesn't allow for capturing the temporal aspects of diabetes development.
   - While the model performs well overall, its lower performance in certain subgroups suggests room for improvement in ensuring equitable predictions across diverse populations.

## Implications and Future Directions

1. **Clinical Application**: The model shows potential as a valuable tool for initial diabetes risk screening in primary care settings. However, it should be used in conjunction with clinical judgment and not as a standalone diagnostic tool.

2. **Public Health Interventions**: The identified important features can guide the development of targeted public health interventions, focusing on modifiable risk factors like BMI, physical activity, and regular health check-ups.

3. **Personalized Risk Assessment**: The model's ability to integrate various risk factors allows for more personalized risk assessments, potentially leading to more tailored prevention strategies.

4. **Health Disparities**: The varying performance across subgroups highlights the need for further research into health disparities and the development of more equitable risk prediction models.

5. **Future Research**:
   - Longitudinal studies to capture the temporal aspects of diabetes development and improve predictive accuracy.
   - Incorporation of additional data sources (e.g., electronic health records, genetic data) to enhance the model's predictive power.
   - Development of subgroup-specific models or features to address performance disparities.
   - Investigation of model interpretability techniques (e.g., SHAP values) to provide more actionable insights for healthcare providers and patients.


## Conclusion and Future Work

This project demonstrates the potential of machine learning in predicting diabetes risk using a comprehensive set of health, lifestyle, and socioeconomic factors. The developed model shows promise in identifying individuals at high risk of diabetes, which could be valuable for targeted prevention and early intervention strategies.

Future work could include:
1. Incorporating additional external datasets for validation
2. Exploring more advanced feature engineering techniques
3. Investigating the use of deep learning models for potentially improved performance
4. Developing a user-friendly interface for healthcare providers to use the model in clinical settings
5. Conducting a longitudinal study to assess how changes in risk factors over time impact diabetes onset

By continuing to refine and validate this model, we can work towards creating a powerful tool for diabetes risk assessment and prevention, potentially improving health outcomes for millions of individuals.