# V15: Stacking with Meta-Learner

Advanced ensemble approach using stacking with Logistic Regression as a meta-learner. Collects Out-of-Fold (OOF) predictions from three base models (XGB, LGBM, CB) and learns optimal ensemble weights.

**Key Features:**
- 3-model ensemble base: XGBoost, LightGBM, CatBoost
- Out-of-Fold (OOF) collection for stacking
- Logistic Regression meta-learner for optimal blending
- External feature encoding from 100K dataset
- Medical domain features (BMI, BP, non-HDL)
- 10-Fold Stratified Cross-Validation
- Probability clipping for calibration

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import gc
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

import xgboost as xgb
import lightgbm as lgb
import catboost as cb

print("V15")

## 2. Load the Data

In [None]:
train = pd.read_csv('/kaggle/input/playground-series-s5e12/train.csv')
test  = pd.read_csv('/kaggle/input/playground-series-s5e12/test.csv')
sub   = pd.read_csv('/kaggle/input/playground-series-s5e12/sample_submission.csv')
orig  = pd.read_csv('/kaggle/input/diabetes-health-indicators-dataset/diabetes_dataset.csv')

TARGET = 'diagnosed_diabetes'

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"External dataset shape: {orig.shape}")

## 3. External Encoding

Mean and count encodings from the 100K diabetes health indicators dataset provide statistical relationships between features and target variable.

In [None]:
base_cols = [c for c in train.columns if c not in ['id', TARGET]]
encoded = []

for col in base_cols:
    mean_map = orig.groupby(col)[TARGET].mean()
    train[f"enc_mean_{col}"] = train[col].map(mean_map)
    test[f"enc_mean_{col}"]  = test[col].map(mean_map)
    encoded.append(f"enc_mean_{col}")
    
    count_map = orig.groupby(col).size()
    train[f"enc_cnt_{col}"] = train[col].map(count_map).fillna(1)
    test[f"enc_cnt_{col}"]  = test[col].map(count_map).fillna(1)
    train[f"enc_cnt_{col}"] = np.log1p(train[f"enc_cnt_{col}"])
    test[f"enc_cnt_{col}"]  = np.log1p(test[f"enc_cnt_{col}"])
    encoded.append(f"enc_cnt_{col}")

print(f"Generated {len(encoded)} external features")

## 4. Safe Feature Engineering

In [None]:
train['bmi_cat'] = pd.cut(train['bmi'], bins=[0,18.5,25,30,999], labels=[0,1,2,3]).astype('int')
test['bmi_cat']  = pd.cut(test['bmi'],  bins=[0,18.5,25,30,999], labels=[0,1,2,3]).astype('int')

train['bp_cat'] = 0
train.loc[(train['systolic_bp']>=140)|(train['diastolic_bp']>=90), 'bp_cat'] = 2
train.loc[((train['systolic_bp']>=120)&(train['systolic_bp']<140))|
          ((train['diastolic_bp']>=80)&(train['diastolic_bp']<90)), 'bp_cat'] = 1

test['bp_cat'] = 0
test.loc[(test['systolic_bp']>=140)|(test['diastolic_bp']>=90), 'bp_cat'] = 2
test.loc[((test['systolic_bp']>=120)&(test['systolic_bp']<140))|
         ((test['diastolic_bp']>=80)&(test['diastolic_bp']<90)), 'bp_cat'] = 1

train['non_hdl'] = train['cholesterol_total'] - train['hdl_cholesterol']
test['non_hdl']  = test['cholesterol_total'] - test['hdl_cholesterol']

print("Medical features created")

## 5. Final Features Preparation

In [None]:
features = base_cols + ['bmi_cat', 'bp_cat', 'non_hdl'] + encoded

# Fill NaNs
for f in encoded:
    train[f] = train[f].fillna(train[f].median())
    test[f]  = test[f].fillna(train[f].median())

X      = train[features].copy()
y      = train[TARGET]
X_test = test[features].copy()

# Label encode categoricals
cat_cols = ['bmi_cat', 'bp_cat'] + train.select_dtypes('object').columns.tolist()
for col in cat_cols:
    if col in X.columns:
        le = LabelEncoder()
        X[col]      = le.fit_transform(X[col].astype(str))
        X_test[col] = le.transform(X_test[col].astype(str))

print(f"Total features: {X.shape[1]}")
print(f"Training set shape: {X.shape}")
print(f"Test set shape: {X_test.shape}")

## 6. 10-Fold Ensemble + OOF for Stacking

Collect Out-of-Fold predictions from three diverse base models. OOF predictions will be used to train a meta-learner.
- **XGBoost**: max_depth=6, learning_rate=0.01 (deeper trees)
- **LightGBM**: max_depth=7, learning_rate=0.01 (deeper trees)
- **CatBoost**: depth=7, learning_rate=0.01 (deeper trees)

Ensemble weights: 50% XGB + 35% LGBM + 15% CB

In [None]:
n_splits = 10
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

oof_xgb = np.zeros(len(X))
oof_lgb = np.zeros(len(X))
oof_cb  = np.zeros(len(X))

test_xgb = np.zeros(len(X_test))
test_lgb = np.zeros(len(X_test))
test_cb  = np.zeros(len(X_test))

print(f"\nTraining {n_splits}-fold ensemble + collecting OOF...\n")

for fold, (trn_idx, val_idx) in enumerate(skf.split(X, y), 1):
    print(f"Fold {fold}/{n_splits}", end=" â†’ ")
    
    X_trn, X_val = X.iloc[trn_idx], X.iloc[val_idx]
    y_trn, y_val = y.iloc[trn_idx], y.iloc[val_idx]

    # XGBoost
    model1 = xgb.XGBClassifier(
        n_estimators=5000, max_depth=6, learning_rate=0.01,
        subsample=0.8, colsample_bytree=0.6,
        reg_alpha=1.5, reg_lambda=2.0,
        random_state=42, tree_method='hist', n_jobs=-1, verbosity=0
    )
    model1.fit(X_trn, y_trn, eval_set=[(X_val, y_val)], early_stopping_rounds=250, verbose=False)

    # LightGBM
    model2 = lgb.LGBMClassifier(
        n_estimators=5000, max_depth=7, learning_rate=0.01,
        num_leaves=48, subsample=0.8, colsample_bytree=0.6,
        reg_alpha=1.5, reg_lambda=2.2, random_state=42, n_jobs=-1, verbose=-1
    )
    model2.fit(X_trn, y_trn, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(250)])

    # CatBoost
    model3 = cb.CatBoostClassifier(
        iterations=5000, depth=7, learning_rate=0.01,
        l2_leaf_reg=6.0, random_seed=42, verbose=False,
        early_stopping_rounds=250
    )
    model3.fit(X_trn, y_trn, eval_set=(X_val, y_val), verbose=False)

    # OOF predictions
    oof_xgb[val_idx] = model1.predict_proba(X_val)[:,1]
    oof_lgb[val_idx] = model2.predict_proba(X_val)[:,1]
    oof_cb[val_idx]  = model3.predict_proba(X_val)[:,1]

    # Test predictions
    test_xgb += model1.predict_proba(X_test)[:,1] / n_splits
    test_lgb += model2.predict_proba(X_test)[:,1] / n_splits
    test_cb  += model3.predict_proba(X_test)[:,1] / n_splits

    # Evaluate blend
    val_blend = oof_xgb[val_idx]*0.50 + oof_lgb[val_idx]*0.35 + oof_cb[val_idx]*0.15
    print(f"AUC = {roc_auc_score(y_val, val_blend):.6f}")

print(f"\nFinal OOF CV AUC: {roc_auc_score(y, oof_xgb*0.50 + oof_lgb*0.35 + oof_cb*0.15):.6f}")

## 7. Stacking Meta-Learner

Logistic Regression learns optimal ensemble weights by training on base model OOF predictions. This approach often outperforms manual weighted blending by discovering feature interactions.

In [None]:
stack_train = np.column_stack([oof_xgb, oof_lgb, oof_cb])
stack_test  = np.column_stack([test_xgb, test_lgb, test_cb])

meta_model = LogisticRegression(random_state=42, max_iter=1000)
meta_model.fit(stack_train, y)

final_pred = meta_model.predict_proba(stack_test)[:,1]

# Learned weights
weights = np.exp(meta_model.coef_[0]) / np.sum(np.exp(meta_model.coef_[0]))
print(f"\nLearned ensemble weights:")
print(f"XGBoost: {weights[0]:.4f}")
print(f"LightGBM: {weights[1]:.4f}")
print(f"CatBoost: {weights[2]:.4f}")

## 8. Final Calibration

Clip probability predictions to [0.01, 0.99] for numerical stability.

In [None]:
final_pred = np.clip(final_pred, 0.01, 0.99)

print(f"Predictions clipped to [0.01, 0.99] range")
print(f"Mean prediction: {final_pred.mean():.5f}")
print(f"Min prediction: {final_pred.min():.5f}")
print(f"Max prediction: {final_pred.max():.5f}")

## 9. Submission

In [None]:
sub[TARGET] = final_pred
sub.to_csv('submission.csv', index=False)

print("\nsubmission.csv saved!")
print(f"Mean prediction: {final_pred.mean():.5f}")

print("\nFirst few predictions:")
sub.head()

## Summary

**V15 Stacking Architecture:**
- **Base Models**: XGB (d=6), LGBM (d=7), CatBoost (d=7) with 5000 estimators each
- **Feature Set**: 75 features (24 base + 3 medical + 48 external)
- **OOF Collection**: 10-fold stratified CV for stack training
- **Meta-Learner**: Logistic Regression learns optimal base model weights
- **Calibration**: Probability clipping to [0.01, 0.99]
- **Expected Performance**: ~0.731 CV AUC

V15 represents an advanced stacking approach that learns ensemble weights rather than using manual hyperparameter tuning. The Logistic Regression meta-learner can discover non-obvious feature interactions between base model predictions.