# Machine Learning - University of Waterloo
### Winter 2026
### Group 11


## Team Members
- Nishant Chacko
- Rajat Gusain
- Senay Hagos
- Harold Xue
- Rosy Zhou
- Miguel Morales Gonzalez


---


## 1. Introduction & Dataset Description

#### Project Overview
- **Tittle**: Predicting Diabetes Risk Using Health Indicators
- **Hypothesis**: Lifestyle and clinical factors (such as exercise frequency, BMI, blood pressure, smoking habits, and diet) significantly influence diabetes risk. Machine learning models can be used to accurately predict whether an individual is diabetic/pre-diabetic, or healthy.

#### Dataset Description
- **Name**: CDC diabetes Dataset
- **Source**: UCI Machine Learning Repository
- **Link**: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators
- **Note**: out of the 3 databases available this is the one used:
diabetes _ binary _ health _ indicators _ BRFSS2015.csv is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is not balanced.

- **Description**: The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. In total there are over 250,000 instances. The dataset contains 21 features consisting of demographics, lab test results, and answers to health-related survey questions for each patient. The target variable for classification is whether a patient is diabetic/pre-diabetic, or healthy.

- **Variables**: The dataset includes 21 input features, a mixture of continuous and binary features. The target variable for classification is whether a patient is diabetic/pre-diabetic = 1, or healthy = 0. See cell below with a table with all the variables details.


#### Other Considerations
- The way the target variable is mapped allows for binary classification: 0 = no diabetes, 1 = prediabetes or diabetes



| Variable Name        | Role    | Type    | Demographic       | Description                                                                                                                                                                                                                     | Units | Missing Values |
|----------------------|---------|---------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|---------------|
| ID                   | ID      | Integer |                   | Patient ID                                                                                                                                                                                                                      |       | no            |
| Diabetes_binary      | Target  | Binary  |                   | 0 = no diabetes, 1 = prediabetes or diabetes                                                                                                                                                                                   |       | no            |
| HighBP               | Feature | Binary  |                   | 0 = no high BP, 1 = high BP                                                                                                                                                                                                    |       | no            |
| HighChol             | Feature | Binary  |                   | 0 = no high cholesterol, 1 = high cholesterol                                                                                                                                                                                  |       | no            |
| CholCheck            | Feature | Binary  |                   | 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years                                                                                                                                                     |       | no            |
| BMI                  | Feature | Integer |                   | Body Mass Index                                                                                                                                                                                                                 |       | no            |
| Smoker               | Feature | Binary  |                   | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no, 1 = yes                                                                                                                  |       | no            |
| Stroke               | Feature | Binary  |                   | (Ever told) you had a stroke. 0 = no, 1 = yes                                                                                                                                                                                  |       | no            |
| HeartDiseaseorAttack | Feature | Binary  |                   | Coronary heart disease (CHD) or myocardial infarction (MI) 0 = no, 1 = yes                                                                                                                                                     |       | no            |
| PhysActivity         | Feature | Binary  |                   | Physical activity in past 30 days - not including job 0 = no, 1 = yes                                                                                                                                                          |       | no            |
| Fruits               | Feature | Binary  |                   | Consume fruit 1 or more times per day 0 = no, 1 = yes                                                                                                                                                                          |       | no            |
| Veggies              | Feature | Binary  |                   | Consume vegetables 1 or more times per day 0 = no, 1 = yes                                                                                                                                                                     |       | no            |
| HvyAlcoholConsump    | Feature | Binary  |                   | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no, 1 = yes                                                                                              |       | no            |
| AnyHealthcare        | Feature | Binary  |                   | Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no, 1 = yes                                                                                                             |       | no            |
| NoDocbcCost          | Feature | Binary  |                   | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no, 1 = yes                                                                                                          |       | no            |
| GenHlth              | Feature | Integer |                   | Would you say that in general your health is: scale 1-5 (1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor)                                                                                                           |       | no            |
| MentHlth             | Feature | Integer |                   | Mental health not good days in past 30 days: scale 1-30                                                                                                                                                                        |       | no            |
| PhysHlth             | Feature | Integer |                   | Physical health not good days in past 30 days: scale 1-30                                                                                                                                                                      |       | no            |
| DiffWalk             | Feature | Binary  |                   | Do you have serious difficulty walking or climbing stairs? 0 = no, 1 = yes                                                                                                                                                     |       | no            |
| Sex                  | Feature | Binary  | Sex               | 0 = female, 1 = male                                                                                                                                                                                                            |       | no            |
| Age                  | Feature | Integer | Age               | 13-level age category (_AGEG5YR see codebook): 1 = 18-24, 9 = 60-64, 13 = 80 or older                                                                                                                                           |       | no            |
| Education            | Feature | Integer | Education Level   | Education level (EDUCA see codebook): scale 1-6 (1 = Never attended school or only kindergarten, 2 = Grades 1-8, 3 = Grades 9-11, 4 = Grade 12 or GED, 5 = College 1-3 years, 6 = College 4 years or more)                     |       | no            |
| Income               | Feature | Integer | Income            | Income scale (INCOME2 see codebook): scale 1-8 (1 = less than \$10,000, 5 = less than \$35,000, 8 = \$75,000 or more)                                                                                                          |       | no            |



## 2. Data Loading & Initial Exploration

#### Load Data
- Import the dataset into your environment.

#### Inspect Structure
- Review column names, data types, and overall schema.

#### Check Target Distribution
- Analyze the balance of classes (diabetes vs. healthy).

> **Purpose**: Understand the dataset’s format, feature types, and class balance to plan the workflow and spot potential issues early.


In [None]:
# The dataset has been downloaded and copied to the Data folder. The file is named diabetes_binary_health_indicators_BRFSS2015.csv


## 3. Pre-Split Data Cleaning

Perform only operations that do **not** learn from the dataset and do not use the target variable.

#### 3.1 Schema & Types
- Rename columns, set data types, standardize units (e.g., BMI), parse dates.

#### 3.2 Deterministic Cleaning
- Trim strings, fix coding errors (e.g., “Yes/No” → 1/0), normalize category labels, drop duplicate rows.

#### 3.3 Domain-Rule Outlier Caps (Optional)
- Apply clinical/domain limits (e.g., BMI ∈ [10, 80]) using fixed rules.

### 3.4 Target Column Checkup
- Check if the Target Column format and contents are as expected

> **Purpose**: Improve data quality without introducing leakage and ensure reproducibility.



## 4. Pre-Split Exploratory Data Analysis (EDA)

#### Purpose
- Understand feature distributions, correlations, and potential issues before splitting.

#### Key Actions
- **Univariate Analysis**: Histograms, boxplots for numeric features.
- **Categorical Analysis**: Frequency tables and bar charts.
- **Correlation Check**: Heatmap for numeric features.
- **Missing Values**: Identify patterns and proportions.

> **Note**: Avoid using target variable for any insights that could cause leakage.



## 5. Train-Test Split

#### Strategy
- **Split Ratio**: Typically 80/20 or 70/30.
- **Stratification**: Ensure class balance in both sets.
- **Random State**: Fix seed for reproducibility.

#### Output
- `X_train`, `X_test`, `y_train`, `y_test`

> **Purpose**: Create unbiased sets for training and evaluation.



## 6. Post-Split Preprocessing

#### 6.1 Missing Value Handling
- Impute using training set statistics only.

#### 6.2 Scaling & Normalization
- Apply transformations (e.g., StandardScaler) fitted on training data.

#### 6.3 Encoding
- One-hot or ordinal encoding for categorical features.

#### 6.4 Outlier Handling (Optional)
- Apply caps or transformations based on domain knowledge or robust methods (e.g., winsorization).

#### 6.5 Feature Selection (Optional)
- Drop low-variance or irrelevant features based on training data only.

> **Purpose**: Prepare data for modeling while preventing data leakage.



## 7. Model-Driven EDA & Diagnostics

Again, all the analysis in this section will be done on **training data only**.

#### Assess Missingness & Outliers
- Calculate missing and outlier statistics using only training data.

#### Correlation & Multicollinearity Analysis
- Examine correlations to understand relationships and decide if PCA or other dimensionality reduction is needed.

#### Feature Importance & Target Relations
- Use techniques like univariate correlation or mutual information on training data to inform feature pruning later (if needed).

> **Goal**: Validate assumptions and guide feature engineering without introducing leakage.



## 8. Class Imbalance Handling

#### Techniques
- **Resampling**: Oversample minority class or undersample majority.
- **Synthetic Data**: Use SMOTE or similar methods.
- **Class Weights**: Adjust model training to penalize imbalance.

#### When to Apply
- After train-test split and before model training.

> **Purpose**: Ensure fair model performance across classes.



## 9. Feature Engineering

#### Purpose
- Enhance predictive power by creating new features or transforming existing ones.

#### Key Actions
- **Domain Features**: Combine or derive clinically relevant indicators.
- **Polynomial / Interaction Terms**: Optional for non-linear relationships.
- **Dimensionality Reduction**: PCA or similar methods (optional).
- **Encoding Refinement**: Ensure categorical variables are properly represented.

> **Note**: Apply transformations only on training data and replicate on test data.



## 10. Baseline Model

#### Objective
- Establish a simple benchmark for performance comparison.

#### Steps
- **Model Choice**: Logistic Regression or Decision Tree.
- **Training**: Fit on preprocessed training data.
- **Evaluation**: Accuracy, Precision, Recall, F1-score on test set.
- **Interpretation**: Review coefficients or feature importance.

> **Purpose**: Provide a reference point before implementing complex models.



## 11. Advanced Models

#### Options
- **Random Forest / Gradient Boosting**: For improved accuracy.
- **Hyperparameter Tuning**: GridSearchCV or RandomizedSearchCV.
- **Cross-Validation**: Ensure robust performance estimates.
- **Model Comparison**: Evaluate against baseline using consistent metrics.

#### Output
- Best-performing model with tuned parameters.

> **Goal**: Achieve optimal predictive performance while maintaining interpretability.



## 12. Model Interpretation & Validation

#### Interpretation
- **Feature Importance**: Use SHAP, permutation importance, or model-specific methods.
- **Partial Dependence**: Visualize feature effects.

#### Validation
- **Hold-Out Test**: Evaluate on test set.
- **Cross-Validation Metrics**: Confirm stability.
- **Error Analysis**: Identify patterns in misclassifications.

> **Purpose**: Ensure the model is both accurate and explainable.



## 13. Final Model Selection

#### Criteria
- **Performance Metrics**: Accuracy, F1-score, ROC-AUC.
- **Interpretability**: Prefer models that balance complexity and clarity.
- **Generalization**: Confirm no overfitting via validation checks.

#### Deliverable
- Selected model with justification for choice.



## 13. Conclusions & Recommendations

#### Key Findings
- *Summarizing what you learned about diabetes risk factors and model performance.*

#### Limitations
- *Discussing any limitations of your analysis or data.*

#### Future Work
- *Suggesting possible extensions or improvements.*



## 14. References

- *Data sources, libraries, and any external resources used.*
