# Machine Learning - University of Waterloo
### Winter 2026
### Group 11


## Team Members
- Nishant Chacko
- Rajat Gusain
- Senay Hagos
- Harold Xue
- Rosy Zhou
- Miguel Morales Gonzalez


---


## 1. Introduction & Dataset Description

#### Project Overview
- **Tittle**: Predicting Diabetes Risk Using Health Indicators
- **Hypothesis**: Lifestyle and clinical factors (such as exercise frequency, BMI, blood pressure, smoking habits, and diet) significantly influence diabetes risk. Machine learning models can be used to accurately predict whether an individual is diabetic/pre-diabetic, or healthy.

#### Dataset Description
- **Name**: CDC diabetes Dataset
- **Source**: UCI Machine Learning Repository
- **Link**: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

- **Description**: The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. In total there are over 250,000 instances. The dataset contains 21 features consisting of demographics, lab test results, and answers to health-related survey questions for each patient. The target variable for classification is whether a patient is diabetic/pre-diabetic, or healthy.

- **Variables**: The dataset includes 21 input features, a mixture of continuous and binary features. The target variable for classification is whether a patient is diabetic/pre-diabetic = 1, or healthy = 0. See cell below with a table with all the variables details.


#### Other Considerations
- The way the target variable is mapped allows for binary classification: 0 = no diabetes, 1 = prediabetes or diabetes



| Variable Name        | Role    | Type    | Demographic       | Description                                                                                                                                                                                                                     | Units | Missing Values |
|----------------------|---------|---------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|---------------|
| ID                   | ID      | Integer |                   | Patient ID                                                                                                                                                                                                                      |       | no            |
| Diabetes_binary      | Target  | Binary  |                   | 0 = no diabetes, 1 = prediabetes or diabetes                                                                                                                                                                                   |       | no            |
| HighBP               | Feature | Binary  |                   | 0 = no high BP, 1 = high BP                                                                                                                                                                                                    |       | no            |
| HighChol             | Feature | Binary  |                   | 0 = no high cholesterol, 1 = high cholesterol                                                                                                                                                                                  |       | no            |
| CholCheck            | Feature | Binary  |                   | 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years                                                                                                                                                     |       | no            |
| BMI                  | Feature | Integer |                   | Body Mass Index                                                                                                                                                                                                                 |       | no            |
| Smoker               | Feature | Binary  |                   | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no, 1 = yes                                                                                                                  |       | no            |
| Stroke               | Feature | Binary  |                   | (Ever told) you had a stroke. 0 = no, 1 = yes                                                                                                                                                                                  |       | no            |
| HeartDiseaseorAttack | Feature | Binary  |                   | Coronary heart disease (CHD) or myocardial infarction (MI) 0 = no, 1 = yes                                                                                                                                                     |       | no            |
| PhysActivity         | Feature | Binary  |                   | Physical activity in past 30 days - not including job 0 = no, 1 = yes                                                                                                                                                          |       | no            |
| Fruits               | Feature | Binary  |                   | Consume fruit 1 or more times per day 0 = no, 1 = yes                                                                                                                                                                          |       | no            |
| Veggies              | Feature | Binary  |                   | Consume vegetables 1 or more times per day 0 = no, 1 = yes                                                                                                                                                                     |       | no            |
| HvyAlcoholConsump    | Feature | Binary  |                   | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no, 1 = yes                                                                                              |       | no            |
| AnyHealthcare        | Feature | Binary  |                   | Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no, 1 = yes                                                                                                             |       | no            |
| NoDocbcCost          | Feature | Binary  |                   | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no, 1 = yes                                                                                                          |       | no            |
| GenHlth              | Feature | Integer |                   | Would you say that in general your health is: scale 1-5 (1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor)                                                                                                           |       | no            |
| MentHlth             | Feature | Integer |                   | Mental health not good days in past 30 days: scale 1-30                                                                                                                                                                        |       | no            |
| PhysHlth             | Feature | Integer |                   | Physical health not good days in past 30 days: scale 1-30                                                                                                                                                                      |       | no            |
| DiffWalk             | Feature | Binary  |                   | Do you have serious difficulty walking or climbing stairs? 0 = no, 1 = yes                                                                                                                                                     |       | no            |
| Sex                  | Feature | Binary  | Sex               | 0 = female, 1 = male                                                                                                                                                                                                            |       | no            |
| Age                  | Feature | Integer | Age               | 13-level age category (_AGEG5YR see codebook): 1 = 18-24, 9 = 60-64, 13 = 80 or older                                                                                                                                           |       | no            |
| Education            | Feature | Integer | Education Level   | Education level (EDUCA see codebook): scale 1-6 (1 = Never attended school or only kindergarten, 2 = Grades 1-8, 3 = Grades 9-11, 4 = Grade 12 or GED, 5 = College 1-3 years, 6 = College 4 years or more)                     |       | no            |
| Income               | Feature | Integer | Income            | Income scale (INCOME2 see codebook): scale 1-8 (1 = less than \$10,000, 5 = less than \$35,000, 8 = \$75,000 or more)                                                                                                          |       | no            |



## 2. Data Understanding & Initial Exploration

#### Loading the Data
- *Loading the Dataset*

#### Basic Data Inspection
- *Displaying the first few rows, data types, and basic statistics.*

#### Target Variable Distribution
- *Visualizing and describing the distribution of diabetic/pre-diabetic vs. healthy cases.*

#### Feature Overview
- *Summarizing key features (continuous, categorical, binary).*



## 3. Data Cleaning & Preprocessing

#### Missing Values Analysis
- *Identifying and handling missing data, null values*

#### Outlier Detection
- *Visualizing and addressing outliers in key features (if needed).*

#### Feature Scaling
- *Normalizing or standardizing continuous variables (e.g., BMI, cholesterol).*

#### Categorical Encoding
- *Encoding categorical variables (e.g., gender, smoking status).*




## 4. Exploratory Data Analysis (EDA)

#### Univariate Analysis
- *Visualizing distributions of individual features.*

#### Bivariate Analysis
- *Exploring relationships between features and the target variable.*

#### Correlation Analysis
- *Showing a correlation matrix and discuss key findings.*



## 5. Train-Test Split

#### Data Split
- *Splitting the data into training and test sets (with stratification).*  



## 6. Class Imbalance Handling

#### Resampling / Weighting
- *Analyzing class balance and apply resampling or class weighting (if needed).*


## 7. Feature Engineering & Selection

#### Feature Creation
- Create new features if needed/justified (e.g., BMI categories). - Optional*

#### Feature Importance
- *Using statistical or model-based methods to assess feature relevance.*

#### Dimensionality Reduction
- *Applying PCA or similar techniques if needed.*



## 8. Baseline Model Development

#### Logistic Regression (Baseline)
- *Training and evaluate a simple logistic regression model.*

#### Baseline Results
- *Presenting accuracy, precision, recall, F1-score, confusion matrix, ROC-AUC.*



## 9. Advanced Model Development

#### Random Forest / Gradient Boosting
- *Training and tune ensemble models.*

#### Support Vector Machine (Optional)
- *Training and evaluate an SVM if dataset size allows.*

#### Hyperparameter Tuning
- *Using cross-validation and grid/random search for best parameters.*

#### Model Comparison
- *Comparing all models using consistent metrics?*



## 10. Model Interpretation & Feature Importance

#### Feature Importance Visualization
- *Plotting and interpret feature importances from ensemble models.*

#### Interpretation Tools
- *Using SHAP or LIME for deeper insights. - Optional?*



## 11. Model Validation & Robustness

#### Cross-Validation Results
- *Present k-fold cross-validation scores for top models.*

#### Overfitting Check
- *Comparing train vs. test performance.*



## 12. Final Evaluation & Model Selection

#### Model Selection
- *Justifying the final model choice based on performance and interpretability.*

#### Summary Table
- *Summarizing all model results in a table.*



## 13. Conclusions & Recommendations

#### Key Findings
- *Summarizing what you learned about diabetes risk factors and model performance.*

#### Limitations
- *Discussing any limitations of your analysis or data.*

#### Future Work
- *Suggesting possible extensions or improvements.*



## 14. References

- *Data sources, libraries, and any external resources used.*
