# üèÜ V21 - Diabetes Prediction Champion Solution (0.69760 Score)

## Kaggle Playground Series - Season 5, Episode 12

### Best Performing Solution | Ultra-Heavy Regularized Ensemble

**Private Score:** 0.69760 (Best)  
**Public Score:** 0.70042  
**Final Rank:** 877/4206 (Top 20.8%)  
**Approach:** 10-Fold Cross-Validation with 3-Model Ensemble + Feature Selection + Isotonic Calibration

---

### Solution Architecture:
1. **External Feature Engineering** - Leverage 100K Diabetes Health Indicators Dataset
2. **Manual Medical Features** - BMI categories, BP categories, clinical ratios
3. **10-Fold Stratified CV** - Balanced train-validation splits
4. **Three Base Models** - XGBoost (50%), LightGBM (35%), CatBoost (15%)
5. **Aggressive Regularization** - L1=3.5, L2=4.0 to prevent overfitting on 700K samples
6. **Feature Selection** - SelectFromModel reduces 75 features to 38 most important
7. **Probability Calibration** - IsotonicRegression for better probability estimates
8. **Submission Generation** - Final test set predictions

---

## üìö Section 1: Load and Explore Data

Import required libraries and load the training, test, and external datasets. Display basic statistics and target distribution to understand the data landscape.

In [None]:
# Import core libraries
import pandas as pd
import numpy as np
import gc
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.isotonic import IsotonicRegression
from sklearn.feature_selection import SelectFromModel

# Gradient Boosting imports
import xgboost as xgb
import lightgbm as lgb
import catboost as cb

print("‚úÖ V21 - Champion Solution with Score 0.69760")
print("üèÜ All libraries imported successfully!")

In [None]:
# Load datasets
train = pd.read_csv('/kaggle/input/playground-series-s5e12/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s5e12/test.csv')
sub = pd.read_csv('/kaggle/input/playground-series-s5e12/sample_submission.csv')
orig = pd.read_csv('/kaggle/input/diabetes-health-indicators-dataset/diabetes_dataset.csv')

TARGET = 'diagnosed_diabetes'

# Display dataset shapes
print(f"üìä Training set shape: {train.shape}")
print(f"üìä Test set shape: {test.shape}")
print(f"üìä Original (external) dataset shape: {orig.shape}")
print(f"üìä Submission template shape: {sub.shape}")

# Display target distribution
print(f"\nüéØ Target Variable Distribution:")
print(train[TARGET].value_counts())
print(f"\nClass Balance:")
print(train[TARGET].value_counts(normalize=True))

In [None]:
# Display first few rows and basic statistics
print("\nüìà Training set head:")
print(train.head())

print("\nüìä Basic statistics:")
print(train.describe())

print(f"\nüî¢ Data types:")
print(train.dtypes)

## üìö Section 2: External Feature Engineering from Original Dataset

Create **mean encoding** and **count encoding** features from the original diabetes health indicators dataset. These external features leverage the 100K-sample original dataset to encode each feature based on its relationship with the target variable in the original data.

In [None]:
# Identify base columns (all except id and target)
base_cols = [c for c in train.columns if c not in ['id', TARGET]]
print(f"Base features to encode: {len(base_cols)}")
print(f"Features: {base_cols}")

# Create external encoding features
encoded = []

for col in base_cols:
    # 1. MEAN ENCODING: Average target value for each feature value in original data
    mean_map = orig.groupby(col)[TARGET].mean()
    train[f"enc_mean_{col}"] = train[col].map(mean_map)
    test[f"enc_mean_{col}"] = test[col].map(mean_map)
    encoded.append(f"enc_mean_{col}")
    
    # 2. COUNT ENCODING: Log-scaled frequency of each feature value in original data
    count_map = orig.groupby(col).size()
    train[f"enc_cnt_{col}"] = np.log1p(train[col].map(count_map).fillna(0))
    test[f"enc_cnt_{col}"] = np.log1p(test[col].map(count_map).fillna(0))
    encoded.append(f"enc_cnt_{col}")

print(f"\n‚úÖ Created {len(encoded)} external encoding features")
print(f"Sample external features: {encoded[:6]}")

In [None]:
# Verify external features
print("‚úÖ External feature sample:")
print(train[encoded[:4]].head())

print(f"\nMissing values in encoded features:")
print(train[encoded].isnull().sum().sum())

## üìö Section 3: Create Manual Clinical Features

Engineer domain-specific features based on medical knowledge and clinical standards:
- **BMI Categories** - WHO classifications (Underweight, Normal, Overweight, Obese)
- **Blood Pressure Categories** - AHA standards (Normal, Elevated, High)
- **Non-HDL Cholesterol** - Clinical predictor of cardiovascular risk

These features capture important non-linear relationships in medical data.

In [None]:
# 1. BMI CATEGORIZATION - WHO Guidelines
# Underweight: BMI < 18.5
# Normal: 18.5 ‚â§ BMI < 25
# Overweight: 25 ‚â§ BMI < 30
# Obese: BMI ‚â• 30

train['bmi_cat'] = pd.cut(train['bmi'], 
                           bins=[0, 18.5, 25, 30, 999], 
                           labels=[0, 1, 2, 3]).astype(int)
test['bmi_cat'] = pd.cut(test['bmi'], 
                          bins=[0, 18.5, 25, 30, 999], 
                          labels=[0, 1, 2, 3]).astype(int)

print("‚úÖ BMI Categories:")
print(f"0=Underweight, 1=Normal, 2=Overweight, 3=Obese")
print(train['bmi_cat'].value_counts().sort_index())

In [None]:
# 2. BLOOD PRESSURE CATEGORIZATION - AHA Guidelines
# Normal: SBP < 120 AND DBP < 80
# Elevated: 120 ‚â§ SBP < 140 OR 80 ‚â§ DBP < 90
# High (Stage 1): SBP ‚â• 140 OR DBP ‚â• 90

train['bp_cat'] = 0  # Normal
train.loc[(train['systolic_bp'] >= 140) | (train['diastolic_bp'] >= 90), 'bp_cat'] = 2  # High
train.loc[((train['systolic_bp'] >= 120) & (train['systolic_bp'] < 140)) | 
          ((train['diastolic_bp'] >= 80) & (train['diastolic_bp'] < 90)), 'bp_cat'] = 1  # Elevated

test['bp_cat'] = 0
test.loc[(test['systolic_bp'] >= 140) | (test['diastolic_bp'] >= 90), 'bp_cat'] = 2
test.loc[((test['systolic_bp'] >= 120) & (test['systolic_bp'] < 140)) | 
         ((test['diastolic_bp'] >= 80) & (test['diastolic_bp'] < 90)), 'bp_cat'] = 1

print("‚úÖ Blood Pressure Categories:")
print(f"0=Normal, 1=Elevated, 2=High")
print(train['bp_cat'].value_counts().sort_index())

In [None]:
# 3. NON-HDL CHOLESTEROL
# Clinical indicator: Total Cholesterol - HDL
# Higher non-HDL indicates more "bad" cholesterol (LDL + VLDL)

train['non_hdl'] = train['cholesterol_total'] - train['hdl_cholesterol']
test['non_hdl'] = test['cholesterol_total'] - test['hdl_cholesterol']

print("‚úÖ Non-HDL Cholesterol Feature:")
print(f"Non-HDL range (train): {train['non_hdl'].min():.2f} to {train['non_hdl'].max():.2f}")
print(f"Non-HDL mean (train): {train['non_hdl'].mean():.2f}")

## üìö Section 4: Prepare Features and Target

Consolidate all features, handle missing values, apply label encoding, and prepare final feature matrices for modeling.

In [None]:
# Consolidate all features
features = base_cols + ['bmi_cat', 'bp_cat', 'non_hdl'] + encoded
print(f"üìä Total features: {len(features)}")
print(f"  - Base features: {len(base_cols)}")
print(f"  - Manual clinical features: 3 (bmi_cat, bp_cat, non_hdl)")
print(f"  - External encoding features: {len(encoded)}")

In [None]:
# Handle missing values in encoded features
# Fill NaNs with median value (created when a value wasn't present in original dataset)
for f in encoded:
    median_val = train[f].median()
    train[f] = train[f].fillna(median_val)
    test[f] = test[f].fillna(median_val)

print(f"‚úÖ Missing values handled")
print(f"Total NaNs in features: {train[features].isnull().sum().sum()}")

In [None]:
# Prepare X, y, and X_test
X = train[features].copy()
y = train[TARGET]
X_test = test[features].copy()

print(f"‚úÖ Feature matrices prepared")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X_test shape: {X_test.shape}")

In [None]:
# Label encode categorical columns
# Tree-based models can handle categories natively, but explicit encoding ensures consistency
cat_cols = ['bmi_cat', 'bp_cat'] + train.select_dtypes('object').columns.tolist()

for col in cat_cols:
    if col in X.columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
        X_test[col] = le.transform(X_test[col].astype(str))

print(f"‚úÖ Label encoding applied to {len(cat_cols)} categorical columns")
print(f"X dtypes after encoding:")
print(X.dtypes.value_counts())

## üìö Section 5: Build 10-Fold Stratified Cross-Validation Ensemble

Implement stratified k-fold cross-validation with three base models:
- **XGBoost (50% weight)** - Primary model, high regularization
- **LightGBM (35% weight)** - Speed and efficiency
- **CatBoost (15% weight)** - Categorical handling and stability

**Ultra-Heavy Regularization:**
- `reg_alpha=3.5, reg_lambda=4.0` (L1 and L2 penalties)
- `max_depth=4` (shallow trees to reduce variance)
- `subsample=0.7, colsample_bytree=0.6` (row/column subsampling)

This prevents overfitting on the large 700K training set.

In [None]:
# Initialize 10-Fold Stratified K-Fold
n_splits = 10
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Out-of-Fold predictions for training and blending
oof_blend = np.zeros(len(X))
test_blend = np.zeros(len(X_test))

print(f"üîÑ Starting {n_splits}-fold ultra-regularized ensemble training...\n")

In [None]:
# Training loop with 10 folds
for fold, (trn_idx, val_idx) in enumerate(skf.split(X, y), 1):
    print(f"Fold {fold}/{n_splits} ‚Üí ", end="")
    
    X_trn, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_trn, y_val = y.iloc[trn_idx], y.iloc[val_idx]

    # ========== XGBoost (50% weight) ==========
    model1 = xgb.XGBClassifier(
        n_estimators=5000,
        max_depth=4,
        learning_rate=0.007,
        subsample=0.7,
        colsample_bytree=0.6,
        reg_alpha=3.5,  # L1 regularization
        reg_lambda=4.0,  # L2 regularization
        random_state=42,
        tree_method='hist',
        n_jobs=-1,
        verbosity=0
    )
    model1.fit(X_trn, y_trn, 
               eval_set=[(X_val, y_val)], 
               early_stopping_rounds=300, 
               verbose=False)

    # ========== LightGBM (35% weight) ==========
    model2 = lgb.LGBMClassifier(
        n_estimators=5000,
        max_depth=4,
        learning_rate=0.007,
        num_leaves=16,
        subsample=0.7,
        colsample_bytree=0.6,
        reg_alpha=3.5,
        reg_lambda=4.0,
        random_state=42,
        n_jobs=-1,
        verbose=-1
    )
    model2.fit(X_trn, y_trn, 
               eval_set=[(X_val, y_val)], 
               callbacks=[lgb.early_stopping(300)])

    # ========== CatBoost (15% weight) ==========
    model3 = cb.CatBoostClassifier(
        iterations=5000,
        depth=4,
        learning_rate=0.007,
        l2_leaf_reg=12.0,
        random_seed=42,
        verbose=False,
        early_stopping_rounds=300
    )
    model3.fit(X_trn, y_trn, eval_set=(X_val, y_val), verbose=False)

    # ========== BLEND PREDICTIONS ==========
    val_pred = (model1.predict_proba(X_val)[:,1] * 0.50 +
                model2.predict_proba(X_val)[:,1] * 0.35 +
                model3.predict_proba(X_val)[:,1] * 0.15)

    oof_blend[val_idx] = val_pred
    fold_auc = roc_auc_score(y_val, val_pred)
    print(f"AUC = {fold_auc:.6f}")

    # Test set predictions
    test_blend += (model1.predict_proba(X_test)[:,1] * 0.50 +
                   model2.predict_proba(X_test)[:,1] * 0.35 +
                   model3.predict_proba(X_test)[:,1] * 0.15) / n_splits

    # Cleanup
    del model1, model2, model3, X_trn, X_val, y_trn, y_val
    gc.collect()

print(f"\n‚úÖ Final CV AUC: {roc_auc_score(y, oof_blend):.6f}")

## üìö Section 6: Perform Feature Selection

Use `SelectFromModel` with the trained XGBoost model to identify the most important features. This reduces dimensionality (75 ‚Üí 38 features) while retaining predictive power. Feature importance from tree models is based on how often a feature is used in splits.

In [None]:
# Feature selection using the first fold's XGBoost model
# SelectFromModel selects features based on importance > threshold (median)
selector = SelectFromModel(model1, threshold='median', prefit=True)
X_sel = selector.transform(X)
X_test_sel = selector.transform(X_test)

selected_features = X.columns[selector.get_support()].tolist()
print(f"‚úÖ Feature Selection Results:")
print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_sel.shape[1]}")
print(f"Reduction: {X.shape[1] - X_sel.shape[1]} features dropped")
print(f"\nTop selected features (first 15):")
print(selected_features[:15])

## üìö Section 7: Train Final Model with Selected Features

Train a final XGBoost classifier on the selected features only. Use optimized hyperparameters with slightly less aggressive regularization than the CV phase.

This final model is trained on ALL training data (not CV folds) for maximum data utilization.

In [None]:
# Re-initialize model1 if not available from loop
final_model = xgb.XGBClassifier(
    n_estimators=2000,
    max_depth=4,
    learning_rate=0.01,
    subsample=0.7,
    colsample_bytree=0.6,
    reg_alpha=3.0,
    reg_lambda=3.5,
    random_state=42,
    tree_method='hist',
    n_jobs=-1,
    verbosity=0
)

# Train on complete training data with selected features
final_model.fit(X_sel, y)

# Generate test predictions
final_pred = final_model.predict_proba(X_test_sel)[:,1]

print(f"‚úÖ Final model trained on {len(selected_features)} selected features")
print(f"Test predictions shape: {final_pred.shape}")
print(f"Prediction statistics:")
print(f"  Min: {final_pred.min():.6f}")
print(f"  Max: {final_pred.max():.6f}")
print(f"  Mean: {final_pred.mean():.6f}")
print(f"  Median: {np.median(final_pred):.6f}")

## üìö Section 8: Apply Isotonic Calibration

Use `IsotonicRegression` to calibrate probability predictions. This fits a monotonic function to map the raw model predictions to better-calibrated probabilities.

**Key points:**
- Fit on out-of-fold (OOF) predictions from CV
- Transform test predictions to improve calibration
- `out_of_bounds='clip'` ensures predictions stay in [0, 1] range

In [None]:
# Initialize and fit isotonic regression calibrator
# Use out-of-fold predictions from CV for fitting
calibrator = IsotonicRegression(out_of_bounds='clip')
calibrator.fit(oof_blend, y)

# Apply calibration to final test predictions
final_pred = calibrator.transform(final_pred)

print(f"‚úÖ Isotonic Regression Calibration Applied")
print(f"\nCalibrated test predictions statistics:")
print(f"  Min: {final_pred.min():.6f}")
print(f"  Max: {final_pred.max():.6f}")
print(f"  Mean: {final_pred.mean():.6f}")
print(f"  Median: {np.median(final_pred):.6f}")
print(f"  Std: {final_pred.std():.6f}")

## üìö Section 9: Generate Submission File

Create the final submission file with predicted probabilities in the required Kaggle format: (id, diagnosed_diabetes_probability).

**Format Requirements:**
- Header: id, diagnosed_diabetes
- One row per test sample
- Probabilities between 0 and 1
- ROC-AUC is the evaluation metric

In [None]:
# Create submission dataframe
sub[TARGET] = final_pred

# Save submission
sub.to_csv('submission.csv', index=False)

print(f"‚úÖ submission.csv saved!")
print(f"\nüìä Submission Statistics:")
print(f"  File size: submission.csv")
print(f"  Rows: {len(sub)}")
print(f"  Columns: {list(sub.columns)}")
print(f"\nüéØ Prediction Distribution:")
print(f"  Mean prediction: {final_pred.mean():.5f}")
print(f"  Min prediction: {final_pred.min():.5f}")
print(f"  Max prediction: {final_pred.max():.5f}")
print(f"  Percentile 25: {np.percentile(final_pred, 25):.5f}")
print(f"  Percentile 50: {np.percentile(final_pred, 50):.5f}")
print(f"  Percentile 75: {np.percentile(final_pred, 75):.5f}")

In [None]:
# Display sample submission
print(f"\nüìù Sample Submission (first 10 rows):")
print(sub.head(10))

## üéØ Summary

### ‚úÖ Solution Components Executed:

1. **Data Loading** - Loaded 700K training + external 100K dataset
2. **External Feature Engineering** - 48 features from Diabetes Health Indicators Dataset
3. **Manual Medical Features** - 3 clinically-informed features (BMI cat, BP cat, non-HDL)
4. **Feature Preparation** - Label encoding, missing value handling
5. **10-Fold CV Ensemble** - 3 models with ultra-heavy regularization
6. **Feature Selection** - Reduced 75 ‚Üí 38 features
7. **Final Model** - XGBoost on selected features
8. **Probability Calibration** - IsotonicRegression for better estimates
9. **Submission Generation** - ROC-AUC ready predictions

### üèÜ Performance:
- **Private Score:** 0.69760 (BEST)
- **Public Score:** 0.70042
- **Final Rank:** 877/4206 (Top 20.8%)
- **CV AUC:** ~0.7299

### üîë Key Success Factors:
- ‚úÖ External dataset leverage (100K samples)
- ‚úÖ Medical domain expertise (clinical features)
- ‚úÖ Ultra-heavy regularization (3.5 L1, 4.0 L2)
- ‚úÖ Balanced ensemble weights (0.50/0.35/0.15)
- ‚úÖ Feature selection + calibration pipeline
- ‚úÖ 10-Fold stratified cross-validation

---

**Next Steps:** Explore SMOTE (V26), deeper feature selection (V24), or additional model variants!