# Playground Series S5E9 — Fast, Efficient Tri‑Blend (LGBM + XGB + CatBoost)

This notebook implements a **head‑to‑tail, fast, and competitive baseline** for the Kaggle Playground Series S5E9 (Predicting the Beats‑per‑Minute of Songs).  
It follows a proven recipe that’s been consistently strong on tabular regression: **5‑fold CV**, **LightGBM / XGBoost / CatBoost**, and a small **weighted blend** tuned on OOF predictions.

**Highlights**
- Minimal preprocessing (label‑encode object cols, median impute numeric).
- Strong defaults with early stopping (fast) + 5‑fold CV (reliable).
- Tiny, grid‑searched blend weights that usually beat any single model.
- Produces **OOF CV metrics**, **feature importances**, and **submission.csv**.

> Tip: If you want an extra ~0.005–0.01 RMSE, duplicate models with 2–3 different `SEEDS` and average per‑model before blending. This notebook starts with one seed for speed.

## 1. Setup & Configuration

In [6]:
# =========================
# Configuration & Imports
# =========================
import os, math, gc, warnings, sys
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

# Tree models
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor, Pool

# Plotting
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

# Kaggle input path
INPUT_DIR = Path("/kaggle/input/playground-series-s5e9")
assert INPUT_DIR.exists(), "Expected Kaggle dataset path to exist: /kaggle/input/playground-series-s5e9"

OUTPUT_DIR = Path("/kaggle/working")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Repro
SEED = 42
np.random.seed(SEED)

# Cross-validation
N_SPLITS = 5
SHUFFLE = True

# Early stopping patience
EARLY_STOPPING = 400

# If you want to try seed ensembling later, add more seeds here and loop.
SEEDS = [SEED]

ModuleNotFoundError: No module named 'lightgbm'

## 2. Load Data

In [5]:
# =========================
# Load train / test / sample
# =========================
train_path = INPUT_DIR / "train.csv"
test_path  = INPUT_DIR / "test.csv"
sub_path   = INPUT_DIR / "sample_submission.csv"

train = pd.read_csv(train_path)
test  = pd.read_csv(test_path)
sample_submission = pd.read_csv(sub_path)

print("Train shape:", train.shape)
print("Test  shape:", test.shape)
display(train.head(3))
display(test.head(3))
display(sample_submission.head(3))

# Identify ID and TARGET
# Most likely: id / bpm. We'll infer TARGET from sample_submission (2nd column).
ID_COL = "id" if "id" in train.columns else train.columns[0]
TARGET = sample_submission.columns[1] if len(sample_submission.columns) >= 2 else "bpm"
assert TARGET in train.columns, f"Could not find target '{TARGET}' in train columns {train.columns.tolist()}"
print(f"Using ID_COL='{ID_COL}', TARGET='{TARGET}'")

NameError: name 'INPUT_DIR' is not defined

## 3. Minimal Preprocessing

In [None]:
# =========================
# Minimal preprocessing
# - Label-encode object cols (fit on combined train+test)
# - Fill numeric NAs with median; object NAs to 'NA'
# =========================
from sklearn.preprocessing import LabelEncoder

# Separate feature columns
features = [c for c in train.columns if c not in [ID_COL, TARGET]]
X = train[features].copy()
X_test = test[features].copy()

# Detect dtypes
obj_cols = X.select_dtypes(include=["object"]).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# Label-encode object columns using combined train+test categories
for col in obj_cols:
    le = LabelEncoder()
    combined = pd.concat([X[col], X_test[col]], axis=0).astype(str).fillna("NA")
    le.fit(combined)
    X[col] = le.transform(X[col].astype(str).fillna("NA"))
    X_test[col] = le.transform(X_test[col].astype(str).fillna("NA"))

# Fill numeric NAs with median
for col in num_cols:
    med = X[col].median()
    X[col] = X[col].fillna(med)
    X_test[col] = X_test[col].fillna(med)

y = train[TARGET].values

print(f"Features: {len(features)} | Numeric: {len(num_cols)} | Object-encoded: {len(obj_cols)}")

## 4. CV Helper & Metric

In [None]:
def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

kf = KFold(n_splits=N_SPLITS, shuffle=SHUFFLE, random_state=SEED)

## 5. Model Configurations (Regularized & Fast)

In [None]:
# =========================
# Base parameters chosen for speed and stability.
# Increase n_estimators if you disable early stopping.
# =========================
lgb_params = dict(
    objective="regression", metric="rmse",
    learning_rate=0.03, num_leaves=64, max_depth=-1,
    min_data_in_leaf=40, feature_fraction=0.9, bagging_fraction=0.9,
    bagging_freq=1, reg_lambda=2.0, reg_alpha=0.2, n_estimators=4000,
    random_state=SEED
)

xgb_params = dict(
    objective="reg:squarederror", tree_method="hist",
    learning_rate=0.03, max_depth=7, min_child_weight=8,
    subsample=0.9, colsample_bytree=0.9, reg_lambda=2.0, reg_alpha=0.2,
    n_estimators=4000, random_state=SEED
)

cat_params = dict(
    loss_function="RMSE", depth=8, learning_rate=0.03,
    l2_leaf_reg=6.0, iterations=4000, random_seed=SEED,
    od_type="Iter", od_wait=EARLY_STOPPING, verbose=0
)

## 6. Train with 5‑Fold CV & Collect OOF / Predictions

In [None]:
# Containers
oof_lgb = np.zeros(len(X), dtype=float)
oof_xgb = np.zeros(len(X), dtype=float)
oof_cat = np.zeros(len(X), dtype=float)

pred_lgb = np.zeros(len(X_test), dtype=float)
pred_xgb = np.zeros(len(X_test), dtype=float)
pred_cat = np.zeros(len(X_test), dtype=float)

# For importances
fi_lgb = np.zeros(len(features), dtype=float)
fi_xgb = np.zeros(len(features), dtype=float)
fi_cat = np.zeros(len(features), dtype=float)

for fold, (tr_idx, va_idx) in enumerate(kf.split(X, y), 1):
    X_tr, y_tr = X.iloc[tr_idx], y[tr_idx]
    X_va, y_va = X.iloc[va_idx], y[va_idx]

    print(f"\n=== Fold {fold}/{N_SPLITS} ===")

    # LightGBM
    lgbm = lgb.LGBMRegressor(**lgb_params)
    lgbm.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        callbacks=[lgb.early_stopping(EARLY_STOPPING), lgb.log_evaluation(0)]
    )
    oof_lgb[va_idx] = lgbm.predict(X_va, num_iteration=lgbm.best_iteration_)
    pred_lgb += lgbm.predict(X_test, num_iteration=lgbm.best_iteration_) / N_SPLITS
    try:
        fi_lgb += lgbm.feature_importances_ / N_SPLITS
    except Exception:
        pass

    # XGBoost
    xgbm = xgb.XGBRegressor(**xgb_params)
    xgbm.fit(
        X_tr, y_tr,
        eval_set=[(X_va, y_va)],
        verbose=False,
        early_stopping_rounds=EARLY_STOPPING
    )
    oof_xgb[va_idx] = xgbm.predict(X_va, iteration_range=(0, xgbm.best_iteration+1 if xgbm.best_iteration else 0))
    pred_xgb += xgbm.predict(X_test, iteration_range=(0, xgbm.best_iteration+1 if xgbm.best_iteration else 0)) / N_SPLITS
    try:
        fi_xgb += xgbm.feature_importances_ / N_SPLITS
    except Exception:
        pass

    # CatBoost
    cat = CatBoostRegressor(**cat_params)
    cat.fit(Pool(X_tr, y_tr), eval_set=Pool(X_va, y_va), use_best_model=True)
    oof_cat[va_idx] = cat.predict(X_va)
    pred_cat += cat.predict(X_test) / N_SPLITS
    try:
        fi_cat += cat.get_feature_importance(Pool(X_va, y_va)) / N_SPLITS
    except Exception:
        pass

# CV metrics
rmse_lgb = rmse(y, oof_lgb)
rmse_xgb = rmse(y, oof_xgb)
rmse_cat = rmse(y, oof_cat)

print("\nOOF RMSEs:")
print(f"  LightGBM : {rmse_lgb:.6f}")
print(f"  XGBoost  : {rmse_xgb:.6f}")
print(f"  CatBoost : {rmse_cat:.6f}")

## 7. Simple Weighted Blend (Grid Search on OOF)

In [None]:
# Grid search weights (w1, w2, w3) for (LGB, XGB, CAT) s.t. w1 + w2 + w3 = 1
grid = np.linspace(0.0, 1.0, 21)  # 0.05 steps
best_rmse, best_w = 1e9, (1/3, 1/3, 1/3)

for w1 in grid:
    for w2 in grid:
        w3 = 1.0 - w1 - w2
        if w3 < 0 or w3 > 1: 
            continue
        oof_blend = w1 * oof_lgb + w2 * oof_xgb + w3 * oof_cat
        score = rmse(y, oof_blend)
        if score < best_rmse:
            best_rmse, best_w = score, (w1, w2, w3)

print(f"Best blend OOF RMSE: {best_rmse:.6f}  |  weights (LGB, XGB, CAT) = {best_w}")
w1, w2, w3 = best_w

# Final predictions
pred_blend = w1 * pred_lgb + w2 * pred_xgb + w3 * pred_cat

## 8. Create Submission

In [None]:
submission = sample_submission.copy()
submission[submission.columns[1]] = pred_blend  # set target column values
submission_path = OUTPUT_DIR / "submission.csv"
submission.to_csv(submission_path, index=False)
submission.head()

## 9. Feature Importances (Average Across Folds)

In [None]:
# Average normalized importances from the three models (when available)
importances = pd.DataFrame({
    "feature": features,
    "lgb": fi_lgb,
    "xgb": fi_xgb,
    "cat": fi_cat
})

# Normalize each column (avoid divide by zero)
for col in ["lgb", "xgb", "cat"]:
    s = importances[col].sum()
    importances[col] = importances[col] / s if s > 0 else importances[col]

importances["avg"] = importances[["lgb", "xgb", "cat"]].mean(axis=1)
imp_top = importances.sort_values("avg", ascending=False).head(30)

plt.figure(figsize=(8, 10))
plt.barh(imp_top["feature"][::-1], imp_top["avg"][::-1])
plt.title("Top 30 Features (avg importance)")
plt.tight_layout()
plt.show()

## 10. OOF Diagnostics

In [None]:
oof_df = pd.DataFrame({
    ID_COL: train[ID_COL],
    TARGET: y,
    "oof_lgb": oof_lgb,
    "oof_xgb": oof_xgb,
    "oof_cat": oof_cat,
    "oof_blend": w1*oof_lgb + w2*oof_xgb + w3*oof_cat,
})
oof_df["resid"] = oof_df["oof_blend"] - oof_df[TARGET]
display(oof_df.head())

plt.figure(figsize=(6,4))
plt.hist(oof_df["resid"], bins=50)
plt.title("OOF Residuals (Blend)")
plt.tight_layout()
plt.show()

## 11. Next Steps (Optional)

- **Seed diversity**: set `SEEDS = [42, 1337, 2025]` and average per‑model across seeds before blending.
- **Light FE**: Try a few pairwise interactions among the top features (add `<f_i> * <f_j>`), but keep extras under ~30.
- **Tune grid granularity**: tighten the blend search around the best weights with finer steps (e.g., 0.01) once CV stabilizes.
- **Submission strategy**: keep submissions to a few strong variants validated by CV; avoid leaderboard overfitting.