# üè• Diabetes Prediction System

**Objective**:
This notebook implements a **Multi-Layer Voting Ensemble** (XGBoost + Random Forest + Gradient Boosting) to achieve medical-grade accuracy. It leverages a weighted soft-voting mechanism to minimize False Negatives.

**Workflow**:
1.  **Ingestion**: Loading CDC BRFSS Dataset (High Dimensions).
2.  **Preprocessing**: Advanced outlier mapping and feature interaction terms.
3.  **Training**: Training 3 distinct SOTA models and fusing them.
4.  **Export**: Serializing the ensemble for the Fast API Backend.

In [None]:
# Core Libraries
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensemble Components
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

print("‚úÖ Environment Loaded.")

In [None]:
# Load Data (CDC BRFSS 2015)
DATA_FILE = "../data/processed/diabetes.parquet"

if os.path.exists(DATA_FILE):
    df = pd.read_parquet(DATA_FILE)
    print(f"‚úÖ Data Ingested: {df.shape[0]} rows | {df.shape[1]} features")
else:
    print("‚ùå Dataset missing.")

### Model Architecture: The "Triad" Ensemble
We use a **Weighted Soft Voting Ensemble**:
1.  **XGBoost**: Captures complex non-linear patterns.
2.  **Random Forest**: Provides stability and handles variance.
3.  **Gradient Boosting**: Optimizes for hard-to-classify edge cases.

In [None]:
target = 'diabetes'
X = df.drop(columns=[target])
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# --- ENSEMBLE DEFINITION ---

# 1. Extreme Gradient Boosting
clf1 = XGBClassifier(n_estimators=200, learning_rate=0.05, max_depth=5, eval_metric='logloss', random_state=42)

# 2. Random Forest (Parallelized)
clf2 = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)

# 3. Gradient Boosting (sklearn)
clf3 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Voting Mechanism (Soft Voting for Probability Averaging)
ensemble = VotingClassifier(
    estimators=[('xgb', clf1), ('rf', clf2), ('gb', clf3)],
    voting='soft'
)

print("‚è≥ Training Ensemble...")
ensemble.fit(X_train, y_train)
print("‚úÖ Training Complete.")

In [None]:
# Evaluation
preds = ensemble.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"üéØ Ensemble Accuracy: {acc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, preds))