# V1: Target-Encoded Categorical Ensemble

Early iteration combining competition data with original diabetes dataset, using target encoding for categorical features and 10-fold validation. This version experiments with domain-specific feature engineering and categorical handling.

**Key Features:**
- Combined 700K competition + 100K original reference data
- 800K total samples for training
- Target encoding for categorical variables
- Label encoding for tree models
- 10-Fold Stratified Cross-Validation
- XGBoost, LightGBM, CatBoost ensemble
- 5000 estimators per model
- Weighted averaging (45/35/20)

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import xgboost as xgb
import lightgbm as lgb
import catboost as cb

print("All libraries imported successfully!")

## 2. Load the Data

In [None]:
# Competition data
train_comp = pd.read_csv("/kaggle/input/playground-series-s5e12/train.csv")
test_comp  = pd.read_csv("/kaggle/input/playground-series-s5e12/test.csv")
submission = pd.read_csv("/kaggle/input/playground-series-s5e12/sample_submission.csv")

original = pd.read_csv("/kaggle/input/diabetes-health-indicators-dataset/diabetes_dataset.csv")

print(f"Competition train shape : {train_comp.shape}")
print(f"Original real data shape : {original.shape}")
print(f"Test shape              : {test_comp.shape}")

## 3. Combine Competition + Original Data

In [None]:
# Keep only shared columns
common_cols = [col for col in train_comp.columns if col in original.columns and col != "id"]

train = train_comp[common_cols + ["diagnosed_diabetes"]].copy()
orig  = original[common_cols + ["diagnosed_diabetes"]].copy()

train = pd.concat([train, orig], ignore_index=True)
print(f"\nCombined training data shape → {train.shape} (competition + original)")

## 4. Categorical Encoding

Apply target encoding (smoothed) for categorical features to capture target distribution patterns.

In [None]:
cat_features = ["gender", "ethnicity", "education_level", "income_level",
                "employment_status", "smoking_status"]

# Target encoding (smoothed with regularization)
global_mean = train["diagnosed_diabetes"].mean()

for col in cat_features:
    target_mean = train.groupby(col)["diagnosed_diabetes"].mean()
    count = train.groupby(col)["diagnosed_diabetes"].count()
    smooth = (target_mean * count + global_mean * 20) / (count + 20)
    
    train[col + "_te"] = train[col].map(smooth)
    test_comp[col + "_te"] = test_comp[col].map(smooth).fillna(global_mean)

# Label encoding for tree models
for col in cat_features:
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col].astype(str))
    test_comp[col] = le.transform(test_comp[col].astype(str))

print("Categorical encoding complete!")

## 5. Prepare Final Data

In [None]:
X = train.drop(["id", "diagnosed_diabetes"], axis=1, errors="ignore")
y = train["diagnosed_diabetes"]
X_test = test_comp.drop("id", axis=1, errors="ignore")

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Total features: {X.shape[1]}")

## 6. 10-Fold Ensemble Training

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
oof_preds = np.zeros(len(X))
test_preds = np.zeros(len(X_test))

print("\nStarting 10-Fold Training...\n")

for fold, (trn_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"  → Fold {fold+1}/10", end=" ")
    
    X_trn, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_trn, y_val = y.iloc[trn_idx], y.iloc[val_idx]
    
    # XGBoost
    model1 = xgb.XGBClassifier(
        n_estimators=5000,
        max_depth=8,
        learning_rate=0.02,
        subsample=0.8,
        colsample_bytree=0.7,
        random_state=42,
        tree_method="hist",
        n_jobs=-1,
        verbosity=0
    )
    model1.fit(X_trn, y_trn, eval_set=[(X_val, y_val)], 
               early_stopping_rounds=150, verbose=False)
    
    # LightGBM
    model2 = lgb.LGBMClassifier(
        n_estimators=5000,
        max_depth=9,
        learning_rate=0.02,
        num_leaves=256,
        subsample=0.8,
        colsample_bytree=0.7,
        random_state=42,
        n_jobs=-1,
        verbose=-1
    )
    model2.fit(X_trn, y_trn, eval_set=[(X_val, y_val)],
               callbacks=[lgb.early_stopping(150)], verbose=False)
    
    # CatBoost
    model3 = cb.CatBoostClassifier(
        iterations=5000,
        depth=9,
        learning_rate=0.03,
        random_seed=42,
        verbose=False,
        early_stopping_rounds=150
    )
    model3.fit(X_trn, y_trn, eval_set=(X_val, y_val), verbose=False)
    
    # Blend per fold
    val_blend = (model1.predict_proba(X_val)[:,1] * 0.45 +
                 model2.predict_proba(X_val)[:,1] * 0.35 +
                 model3.predict_proba(X_val)[:,1] * 0.20)
    
    test_blend = (model1.predict_proba(X_test)[:,1] * 0.45 +
                  model2.predict_proba(X_test)[:,1] * 0.35 +
                  model3.predict_proba(X_test)[:,1] * 0.20) / skf.n_splits
    
    oof_preds[val_idx] = val_blend
    test_preds += test_blend
    
    fold_auc = roc_auc_score(y_val, val_blend)
    print(f"| AUC = {fold_auc:.6f}")

print(f"\nFinal CV AUC: {roc_auc_score(y, oof_preds):.6f}")
print("Training complete!")

## 7. Create Submission

In [None]:
submission["diagnosed_diabetes"] = test_preds
submission.to_csv("submission.csv", index=False)

print("\nsubmission.csv saved!")
print(f"Mean predicted probability: {test_preds.mean():.5f}")
print(f"Min predicted probability: {test_preds.min():.5f}")
print(f"Max predicted probability: {test_preds.max():.5f}")

print("\nFirst few predictions:")
submission.head(10)

## Summary

**V1: Target-Encoded Categorical Ensemble**

**Architecture:**
- **Data Combination**: 700K competition + 100K original = 800K training samples
- **Categorical Encoding**:
  - Target encoding (smoothed): captures category relationship with target
  - Label encoding: enables tree model categorical support
  - Regularization: smoothing factor=20 prevents overfit
- **Features**: All shared columns between datasets
- **Target Encoding Variables**: gender, ethnicity, education_level, income_level, employment_status, smoking_status
- **Model Configuration**:
  - 10-Fold Stratified CV
  - 5000 estimators per model
  - Learning rate: 0.02-0.03 (conservative)
  - Early stopping: 150 rounds
- **Ensemble**: XGB (45%) + LGBM (35%) + CB (20%)

V1 explores data combination strategies and categorical feature encoding, using a large training set (800K samples) to improve model robustness.