Model Card: Diabetes Risk Classifier

**1. Model Details**
Model Name: Diabetes Risk Predictor v1.0

Version: 1.0

Model Type: Binary Classification Model using K-nearest neighbors classifier.

Creators: Juan Rivera Medina

Creation Date: August 05, 2025

Purpose: To assess an individual's risk of developing Type 2 Diabetes based on clinical and demographic features.

**2. Intended Use**
Primary Use Case: This model is intended as a decision-support tool for healthcare providers (e.g., general practitioners, endocrinologists, dietitians) to identify individuals who may be at high risk of developing Type 2 Diabetes.

Target Users: Clinicians and public health professionals.

Benefits:

Facilitate early intervention strategies (e.g., lifestyle modifications, closer monitoring).

Prioritize patient education and preventative care.

Support personalized risk assessment in routine check-ups.

Out-of-Scope Use Cases: This model is not intended for self-diagnosis by patients or as a sole diagnostic tool. It should not replace professional medical judgment or comprehensive clinical evaluation.

**3. Training Data**
Dataset Name: diabetes_012_health_indicators_BRFSS2015

Source: 

Data Size: 253680 instances and 23 Features

Features Used:

- ID: Patient ID
- Diabetes_binary: Has diabetes (0 = no diabetes 1 = prediabetes or diabetes)
- HighBP: HIgh Blood presssure, (0 = no high BP 1 = high BP)
- HighChol: High cholesterol, (0 = no high cholesterol 1 = high cholesterol)
- CholCheck: Has had a Cholesterol check in last 5 years?	(0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years)
- BMI: Body Mass Index
- Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] (0 = no 1 = yes)
- Stroke: (Ever told) you had a stroke. (0 = no 1 = yes)
- HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) (0 = no 1 = yes)
- PhysActivity: physical activity in past 30 days - not including job (0 = no 1 = yes)
- Fruits: Consume Fruit 1 or more times per day (0 = no 1 = yes)
- Veggies: Consume Vegetables 1 or more times per day (0 = no 1 = yes)
- HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) (0 = no 1 = yes)
- AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. (0 = no 1 = yes)
- NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? (0 = no 1 = yes)
- GenHlth: Would you say that in general your health is: scale 1-5 (1 = excellent 2 = very good 3 = good 4 = fair 5 = poor)
- MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days
- PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days
- DiffWalk: Do you have serious difficulty walking or climbing stairs? (0 = no 1 = yes)
- Sex: Sex	(0 = female 1 = male)
- Age: Age	13-level age category (_AGEG5YR see codebook) (1 = 18-24 9 = 60-64 13 = 80 or older)
- Education: Education Level	Education level (EDUCA see codebook) scale 1-6 (1 = Never attended school or only kindergarten, 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate))
- Income: Income scale (INCOME2 see codebook) scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more

**Preprocessing:**

Missing Values: For this data set no missing values were found.

Outliers: These were filtered for Age, BMI and Income. Rows were removed when the outlier was identified.

Feature Scaling:  Current features are separate into boolean (1 or 0) and integer values on ranges between 0 to 100. A max-min algorithm was implemented to normalize variables

**Limitations of Training Data:**

Demographic Specificity: This investigation was done all around US.ñ

Temporal Relevance: Data collected in 2024.

Feature Completeness: Lacks certain potentially relevant features like detailed dietary habits, physical activity levels, or genetic markers beyond the pedigree function, also relationship between education or income to a physical condition.

**4. Evaluation Data**
Dataset Name: CDC Diabetes Health Indicators

Source: Same source as the training data

Data Size: it will be taken 30% of the total data

Evaluation Methodology: The model was evaluated on a held-out test set that was not used during training or hyperparameter tuning. Performance metrics were calculated after applying the optimal classification threshold.

**5. Performance Metrics**
Performance was evaluated using a combination of metrics crucial for medical classification, considering the potential imbalance in diabetes prevalence.

Accuracy: 

Sensitivity (Recall): 

Specificity:

Precision: 

F1-Score: 

ROC AUC (Receiver Operating Characteristic Area Under Curve): 

Optimal Threshold: 

**6. Ethical Considerations & Bias**
Potential Biases:

Demographic Bias: As noted in training data limitations, the model's performance may degrade or exhibit bias when applied to populations outside the US population due to differences in genetic predisposition, environmental factors, or healthcare access.

Fairness: Continuous monitoring and re-evaluation of model performance across diverse demographic subgroups (if such data becomes available) is crucial to ensure equitable risk prediction.

Transparency: The model's predictions should always be presented with confidence scores and, ideally, with explanations (e.g., feature importance) to aid clinician understanding.

Responsible Use: Emphasize that the model is a tool to assist, not replace, human medical professionals. All predictions must be reviewed and validated by a qualified clinician in the context of a patient's full medical history.

**7. Limitations**
Generalizability: The model's performance is optimized for the characteristics of the training data. Its applicability to diverse populations.

Feature Scope: The model relies solely on the provided 23 features. It does not account for other potentially significant risk factors not present in the dataset (e.g., detailed diet, exercise habits, specific genetic markers not captured by DPF).

**8. Recommendations for Use**
Clinical Context: Always use the model's output in conjunction with full clinical assessment, patient history, and other diagnostic tests.

Population Specificity: Exercise caution and validate thoroughly if applying the model to patient populations significantly different from the training data.

Monitoring: Implement a robust system for monitoring the model's performance in real-world clinical settings to detect any performance degradation (model drift) over time.

Feedback Loop: Establish a feedback mechanism from clinicians to continuously improve the model based on real-world outcomes and insights.

Regular Updates: Plan for periodic model retraining and re-validation with new, diverse data to maintain its relevance and accuracy.

**9. Contact Information**
For questions, feedback, or to report issues regarding this model, please contact:

jcrm86@gmail.com