# Learning-to-Rank Model for E-commerce Search

This notebook implements a Learning-to-Rank solution for ranking products in an e-commerce search engine.

## Tasks
1. **Part 1: Data analysis** - Calculate key data analysis metrics
2. **Part 2: Learning-to-rank model** - Build and evaluate the ranking model
3. **Part 3: Business summary** - Business analysis and recommendations

## Model features
- **Required features**:
  - `position_boost = 1/position` (clipped at position 3)
  - `log_price = log(price_pln + 1)`
  - `quality_price_ratio = quality_score / log_price`
  - `category_match = (category == user_preferred_category) ? 1 : 0`
  - Additional session-relative and interaction features

- **Advanced features**:
  - Session-relative features (price/quality rankings within session)
  - Smoothed CTR priors computed on the training split only (Bayesian m-estimate)
  - Position-bias control via buckets
  - Category interactions
  - Robust cross-validation evaluation

## Outputs
- `results.json` - Complete analysis results
- `predictions.csv` - Model predictions on the test split
- `solution_summary.md` - Solution summary

**Expected input:** `search_sessions.csv` in the same directory.


In [1]:
# pip install pandas numpy scikit-learn lightgbm matplotlib

import os
import json
import logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from lightgbm import LGBMRanker, early_stopping, log_evaluation

# Set logging level to ERROR
logging.basicConfig(level=logging.ERROR)
logging.getLogger('lightgbm').setLevel(logging.ERROR)

pd.set_option('display.max_columns', 400)
pd.set_option('display.width', 200)

DATA_PATH = "search_sessions.csv"
assert os.path.exists(DATA_PATH), f"Missing {DATA_PATH}. Put it next to this notebook."

df = pd.read_csv(DATA_PATH)

# Basic coercions
df["clicked"] = df["clicked"].astype(int)
df["position"] = df["position"].astype(int)
df["price_pln"] = df["price_pln"].astype(float)
df["quality_score"] = df["quality_score"].astype(float)

print(f"Dataset shape: {df.shape}")
print(f"Sessions: {df['session_id'].nunique()}")
print(f"Products: {df['product_id'].nunique()}")
df.head()

Dataset shape: (67651, 8)
Sessions: 8000
Products: 67651


Unnamed: 0,session_id,product_id,position,clicked,price_pln,category,quality_score,user_preferred_category
0,0,prod_0_1,1,1,40.05,Elektronika,0.363,Elektronika
1,0,prod_0_2,2,0,249.63,Elektronika,0.547,Elektronika
2,0,prod_0_3,3,0,679.19,Elektronika,0.696,Elektronika
3,0,prod_0_4,4,0,493.38,Elektronika,0.1,Elektronika
4,0,prod_0_5,5,0,2982.39,Odziez,0.384,Elektronika


## Part 1: Data analysis

Calculate the 5 key data analysis metrics required by the instructions.


In [2]:
# Part 1: Data Analysis
# Calculate the 5 required metrics

# 1. overall_ctr: All clicks / all impressions
overall_ctr = df["clicked"].mean()

# 2. position_bias_ratio: CTR position 1 / CTR position 5
ctr_by_position = df.groupby("position")["clicked"].mean()
position_bias_ratio = ctr_by_position[1] / ctr_by_position[5] if 5 in ctr_by_position.index and ctr_by_position[5] > 0 else 0.0

# 3. electronics_ctr: CTR for category "Elektronika"
electronics_ctr = df[df["category"] == "Elektronika"]["clicked"].mean()

# 4. quality_correlation: Correlation between quality_score and clicked
quality_correlation = df["quality_score"].corr(df["clicked"])

# 5. best_category: Category with highest CTR
ctr_by_category = df.groupby("category")["clicked"].mean()
best_category = ctr_by_category.idxmax()

# Store results
data_analysis = {
    "overall_ctr": round(float(overall_ctr), 4),
    "position_bias_ratio": round(float(position_bias_ratio), 2),
    "electronics_ctr": round(float(electronics_ctr), 4),
    "quality_correlation": round(float(quality_correlation), 4),
    "best_category": str(best_category)
}

print("Data Analysis Results:")
print(json.dumps(data_analysis, indent=2))
print(f"\nCTR by position:\n{ctr_by_position}")
print(f"\nCTR by category:\n{ctr_by_category}")


Data Analysis Results:
{
  "overall_ctr": 0.0767,
  "position_bias_ratio": 3.42,
  "electronics_ctr": 0.1219,
  "quality_correlation": 0.1198,
  "best_category": "Elektronika"
}

CTR by position:
position
1     0.198625
2     0.111000
3     0.081500
4     0.067375
5     0.058000
6     0.045318
7     0.042628
8     0.034455
9     0.037443
10    0.029592
11    0.027211
12    0.032191
Name: clicked, dtype: float64

CTR by category:
category
Elektronika    0.121921
Ksiazki        0.036991
Odziez         0.070694
Name: clicked, dtype: float64


## Part 2: Learning-to-rank model

### Utilities and helper functions

In [3]:
EPS = 1e-6

# --- Metrics ---

def ndcg_at_k(y_true, y_score, group_ids, k=5, ignore_no_positive=True):
    """Mean NDCG@k over groups.

    - ignore_no_positive=True: skip sessions with no positive labels (IDCG=0)
      This matches common LTR evaluation practice and typical library behavior.
    - ignore_no_positive=False: include such sessions as 0.0.
    """
    gdf = pd.DataFrame({"g": group_ids, "y": y_true, "s": y_score})
    ndcgs = []
    for _, part in gdf.groupby("g", sort=False):
        part = part.sort_values("s", ascending=False)
        rel = part["y"].to_numpy()[:k]

        discounts = np.log2(np.arange(2, rel.size + 2))
        dcg = ((2**rel - 1) / discounts).sum()

        ideal = np.sort(part["y"].to_numpy())[::-1][:k]
        idcg = ((2**ideal - 1) / discounts[: ideal.size]).sum()

        if idcg == 0:
            if ignore_no_positive:
                continue
            ndcgs.append(0.0)
        else:
            ndcgs.append(float(dcg / idcg))

    return float(np.mean(ndcgs)) if ndcgs else 0.0


# --- Splits & ordering ---

def finalize_order(df):
    """Ensure stable ordering/contiguity by session_id before group size computation."""
    return df.sort_values(["session_id", "position", "product_id"]).reset_index(drop=True)


def group_sizes(df, group_col="session_id"):
    # Requires df already sorted by group_col
    return df.groupby(group_col, sort=False).size().to_numpy()


def split_by_session(df, seed=42, test_size=0.2):
    """Split by session_id (no leakage). Returns (train_df, test_df)."""
    sessions = df["session_id"].unique()
    train_sess, test_sess = train_test_split(sessions, test_size=test_size, random_state=seed, shuffle=True)
    train_df = df[df["session_id"].isin(train_sess)].copy()
    test_df = df[df["session_id"].isin(test_sess)].copy()
    return finalize_order(train_df), finalize_order(test_df)


def split_train_val_from_train(train_full_df, seed=42, val_fraction=0.1):
    """Create an internal validation split from the train portion only."""
    tr, va = split_by_session(train_full_df, seed=seed, test_size=val_fraction)
    return tr, va


# --- Feature utilities ---

def m_estimate_ctr(train: pd.DataFrame, key_cols, label_col="clicked", m=50.0):
    prior = float(train[label_col].mean())
    agg = train.groupby(key_cols, observed=False)[label_col].agg(["sum", "count"]).reset_index()
    agg["ctr_prior"] = (agg["sum"] + m * prior) / (agg["count"] + m)
    return agg[key_cols + ["ctr_prior"]], float(prior)


def clip_series(s, lo=-3.0, hi=3.0):
    return s.clip(lower=lo, upper=hi)


def zscore_in_group(x: pd.Series) -> pd.Series:
    mu = x.mean()
    std = x.std(ddof=0)
    if (not np.isfinite(std)) or std <= 0:
        return pd.Series(np.zeros(len(x), dtype=float), index=x.index)
    out = (x - mu) / std
    return out.replace([np.inf, -np.inf], 0.0).fillna(0.0)


## 2) Feature Engineering v4

In [4]:
def add_position_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["position_bucket"] = pd.cut(
        out["position"],
        bins=[0,3,6,10],
        labels=["top3","mid46","bot710"],
        include_lowest=True
    )
    pb = pd.get_dummies(out["position_bucket"], prefix="posb")
    out = pd.concat([out, pb], axis=1)
    out["pos_boost_clipped3"] = 1.0 / out["position"].clip(upper=3).astype(float)
    return out

def session_relative_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["session_size"] = out.groupby("session_id")["product_id"].transform("size")

    # price: lower is "better" (rank 1 = cheapest)
    out["price_rank_in_session"] = out.groupby("session_id")["price_pln"].rank(method="average", ascending=True)
    out["price_pct_in_session"] = out["price_rank_in_session"] / out["session_size"].replace(0, 1)
    out["price_min_in_session"] = out.groupby("session_id")["price_pln"].transform("min")
    out["price_max_in_session"] = out.groupby("session_id")["price_pln"].transform("max")
    out["is_cheapest_in_session"] = (out["price_pln"] == out["price_min_in_session"]).astype(int)
    out["is_most_expensive_in_session"] = (out["price_pln"] == out["price_max_in_session"]).astype(int)

    out["price_median_in_session"] = out.groupby("session_id")["price_pln"].transform("median")
    out["price_minus_session_median"] = out["price_pln"] - out["price_median_in_session"]

    out["price_z_in_session"] = out.groupby("session_id")["price_pln"].transform(zscore_in_group).astype(float)
    out["price_z_in_session_clipped"] = clip_series(out["price_z_in_session"], -3, 3)

    # quality: higher is "better" (rank 1 = highest quality)
    out["quality_rank_in_session"] = out.groupby("session_id")["quality_score"].rank(method="average", ascending=False)
    out["quality_pct_in_session"] = out["quality_rank_in_session"] / out["session_size"].replace(0, 1)
    out["quality_min_in_session"] = out.groupby("session_id")["quality_score"].transform("min")
    out["quality_max_in_session"] = out.groupby("session_id")["quality_score"].transform("max")
    out["is_best_quality_in_session"] = (out["quality_score"] == out["quality_max_in_session"]).astype(int)
    out["is_worst_quality_in_session"] = (out["quality_score"] == out["quality_min_in_session"]).astype(int)

    out["quality_mean_in_session"] = out.groupby("session_id")["quality_score"].transform("mean")
    out["quality_minus_session_mean"] = out["quality_score"] - out["quality_mean_in_session"]

    out["quality_z_in_session"] = out.groupby("session_id")["quality_score"].transform(zscore_in_group).astype(float)

    out["category_freq_in_session"] = out.groupby(["session_id", "category"])["product_id"].transform("size")

    counts = out.groupby(["session_id", "category"], observed=False).size().reset_index(name="cnt")
    counts["max_cnt"] = counts.groupby("session_id")["cnt"].transform("max")
    counts["is_majority_category"] = (counts["cnt"] == counts["max_cnt"]).astype(int)
    out = out.merge(counts[["session_id", "category", "is_majority_category"]], on=["session_id", "category"], how="left")
    out["is_majority_category"] = out["is_majority_category"].fillna(0).astype(int)

    return out

def add_core_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["log_price"] = np.log(out["price_pln"] + 1.0)
    out["category_match"] = (out["category"] == out["user_preferred_category"]).astype(int)
    out["quality_price_ratio"] = out["quality_score"] / out["log_price"].replace(0.0, EPS)
    return out

def one_hot_categories(train: pd.DataFrame, test: pd.DataFrame):
    # Drop one level to avoid structural multicollinearity (one-hot sums to 1)
    # This also removes several exact linear dependencies with interaction features.
    cat_train = pd.get_dummies(train["category"], prefix="cat", drop_first=True)
    pref_train = pd.get_dummies(train["user_preferred_category"], prefix="pref", drop_first=True)

    cat_test = pd.get_dummies(test["category"], prefix="cat").reindex(columns=cat_train.columns, fill_value=0)
    pref_test = pd.get_dummies(test["user_preferred_category"], prefix="pref").reindex(columns=pref_train.columns, fill_value=0)

    train2 = pd.concat([train, cat_train, pref_train], axis=1)
    test2  = pd.concat([test, cat_test, pref_test], axis=1)
    return train2, test2, list(cat_train.columns), list(pref_train.columns)

def add_ctr_priors(train: pd.DataFrame, test: pd.DataFrame, m=50.0):
    global_ctr = train["clicked"].mean()

    map_cat, _ = m_estimate_ctr(train, ["category"], m=m)
    train = train.merge(map_cat.rename(columns={"ctr_prior":"ctr_prior_category"}), on=["category"], how="left")
    test  = test.merge(map_cat.rename(columns={"ctr_prior":"ctr_prior_category"}), on=["category"], how="left")

    map_cxp, _ = m_estimate_ctr(train, ["category","user_preferred_category"], m=m)
    train = train.merge(map_cxp.rename(columns={"ctr_prior":"ctr_prior_cat_x_pref"}), on=["category","user_preferred_category"], how="left")
    test  = test.merge(map_cxp.rename(columns={"ctr_prior":"ctr_prior_cat_x_pref"}), on=["category","user_preferred_category"], how="left")

    map_pref, _ = m_estimate_ctr(train, ["user_preferred_category"], m=m)
    train = train.merge(map_pref.rename(columns={"ctr_prior":"ctr_prior_pref"}), on=["user_preferred_category"], how="left")
    test  = test.merge(map_pref.rename(columns={"ctr_prior":"ctr_prior_pref"}), on=["user_preferred_category"], how="left")

    map_cpb, _ = m_estimate_ctr(train, ["category","position_bucket"], m=m)
    train = train.merge(map_cpb.rename(columns={"ctr_prior":"ctr_prior_cat_x_posb"}), on=["category","position_bucket"], how="left")
    test  = test.merge(map_cpb.rename(columns={"ctr_prior":"ctr_prior_cat_x_posb"}), on=["category","position_bucket"], how="left")

    for d in (train, test):
        d["ctr_prior_category"] = d["ctr_prior_category"].fillna(global_ctr)
        d["ctr_prior_cat_x_pref"] = d["ctr_prior_cat_x_pref"].fillna(global_ctr)
        d["ctr_prior_pref"] = d["ctr_prior_pref"].fillna(global_ctr)
        d["ctr_prior_cat_x_posb"] = d["ctr_prior_cat_x_posb"].fillna(global_ctr)

    return train, test, float(global_ctr)

def add_interactions(d: pd.DataFrame, cat_cols) -> pd.DataFrame:
    out = d.copy()

    out["match_x_posb_top3"] = out["category_match"] * out.get("posb_top3", 0)
    out["match_x_posb_mid46"] = out["category_match"] * out.get("posb_mid46", 0)
    out["match_x_posb_bot710"] = out["category_match"] * out.get("posb_bot710", 0)

    out["quality_rank_x_match"] = out["quality_rank_in_session"] * out["category_match"]
    out["price_rank_x_match"] = out["price_rank_in_session"] * out["category_match"]
    out["quality_pct_x_match"] = out["quality_pct_in_session"] * out["category_match"]
    out["price_pct_x_match"] = out["price_pct_in_session"] * out["category_match"]

    out["quality_x_log_price"] = out["quality_score"] * out["log_price"]

    for c in cat_cols:
        out[f"{c}_x_quality"] = out[c] * out["quality_score"]
        out[f"{c}_x_log_price"] = out[c] * out["log_price"]

    return out

def build_features_v4(train_raw: pd.DataFrame, test_raw: pd.DataFrame, m=50.0, use_ipw=False):
    train = add_core_features(train_raw)
    test  = add_core_features(test_raw)

    train = add_position_features(train)
    test  = add_position_features(test)

    train = session_relative_features(train)
    test  = session_relative_features(test)

    train, test, cat_cols, pref_cols = one_hot_categories(train, test)

    train, test, global_ctr = add_ctr_priors(train, test, m=m)

    train = add_interactions(train, cat_cols)
    test  = add_interactions(test, cat_cols)

    if use_ipw:
        prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
        prop = {str(k): v for k, v in prop.items()}  # Convert dict keys to strings
        p_train = train["position_bucket"].astype(str).map(prop).astype(float).fillna(train["clicked"].mean()).clip(0.01, 0.99)
        p_test  = test["position_bucket"].astype(str).map(prop).astype(float).fillna(train["clicked"].mean()).clip(0.01, 0.99)
        train["ipw"] = 1.0 / p_train
        test["ipw"] = 1.0 / p_test
    else:
        train["ipw"] = 1.0
        test["ipw"] = 1.0

    base = [
        "pos_boost_clipped3","posb_top3","posb_mid46","posb_bot710",
        "log_price","quality_price_ratio","category_match",
        "price_rank_in_session","quality_rank_in_session",
        "price_pct_in_session","quality_pct_in_session",
        "price_minus_session_median","quality_minus_session_mean",
        "price_z_in_session_clipped","quality_z_in_session",
        "is_cheapest_in_session","is_most_expensive_in_session",
        "is_best_quality_in_session","is_worst_quality_in_session",
        "session_size","category_freq_in_session","is_majority_category",
        "ctr_prior_category","ctr_prior_cat_x_pref","ctr_prior_pref","ctr_prior_cat_x_posb",
        "match_x_posb_top3","match_x_posb_mid46","match_x_posb_bot710",
        "quality_rank_x_match","price_rank_x_match","quality_pct_x_match","price_pct_x_match",
    ]

    extra = [c for c in train.columns if c.startswith("cat_") or c.startswith("pref_") or c.endswith("_x_quality") or c.endswith("_x_log_price")]
    feature_cols = base + extra

    for c in feature_cols:
        if c not in train.columns: train[c] = 0
        if c not in test.columns: test[c] = 0

    # Critical: enforce contiguity by session_id before group size computation
    train = finalize_order(train)
    test  = finalize_order(test)

    return train, test, feature_cols, global_ctr

# 80/20 split by session_id (as required). Validation is taken from the 80% train portion only.
train_full_raw, test_raw = split_by_session(df, seed=42)
train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=42, val_fraction=0.1)

# Fit features on train only; apply to val/test (priors computed on train only)
train_fe, val_fe, feature_cols, global_ctr = build_features_v4(train_raw, val_raw, m=50.0, use_ipw=False)
_, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=50.0, use_ipw=False)

print("Feature count:", len(feature_cols))
train_fe[["session_id","product_id","clicked"] + feature_cols[:15]].head()

Feature count: 42


Unnamed: 0,session_id,product_id,clicked,pos_boost_clipped3,posb_top3,posb_mid46,posb_bot710,log_price,quality_price_ratio,category_match,price_rank_in_session,quality_rank_in_session,price_pct_in_session,quality_pct_in_session,price_minus_session_median,quality_minus_session_mean,price_z_in_session_clipped,quality_z_in_session
0,0,prod_0_1,1,1.0,True,False,False,3.714791,0.097717,1,1.0,9.0,0.090909,0.818182,-453.33,-0.146364,-0.697457,-0.636561
1,0,prod_0_2,0,0.5,True,False,False,5.523978,0.099023,1,5.0,6.0,0.454545,0.545455,-243.75,0.037636,-0.431507,0.163687
2,0,prod_0_3,0,0.333333,True,False,False,6.522372,0.10671,1,10.0,3.0,0.909091,0.272727,185.81,0.186636,0.11359,0.811715
3,0,prod_0_4,0,0.333333,False,True,False,6.203304,0.01612,1,6.0,11.0,0.545455,1.0,0.0,-0.409364,-0.122197,-1.780395
4,0,prod_0_5,0,0.333333,False,True,False,8.000816,0.047995,0,11.0,8.0,1.0,0.727273,2489.01,-0.125364,3.0,-0.545229


## 3) Train LightGBM Ranker (stability-focused)

In [5]:
# --- Automatic correlated-feature pruning (optional) ---
# Train a baseline model to get importances, drop one feature from each highly-correlated pair,
# then re-train.
AUTO_PRUNE_CORRELATED = True
PRUNE_CORR_METHOD = "spearman"   # "spearman" (monotonic), or "pearson" (linear)
PRUNE_CORR_THRESHOLD = 0.97
PRUNE_MIN_FEATURES = 20


def train_ranker(train_fe, val_fe, feature_cols, seed=42, use_weights=False):
    """Train on train_fe, early-stop on val_fe. Do not use the final test set for model selection."""
    X_train = train_fe[feature_cols]
    y_train = train_fe["clicked"].astype(int)
    X_val = val_fe[feature_cols]
    y_val = val_fe["clicked"].astype(int)

    train_group = group_sizes(train_fe, "session_id")
    val_group = group_sizes(val_fe, "session_id")

    model = LGBMRanker(
        objective="lambdarank",
        metric="ndcg",
        ndcg_eval_at=[5],
        n_estimators=8000,
        learning_rate=0.02,
        num_leaves=63,
        min_data_in_leaf=200,
        min_child_samples=None,  # Explicitly disable to avoid warning when min_data_in_leaf is set
        subsample=0.85,
        colsample_bytree=0.85,
        reg_lambda=2.0,
        reg_alpha=0.5,
        random_state=seed,
    )

    fit_kwargs = dict(
        X=X_train,
        y=y_train,
        group=train_group,
        eval_set=[(X_val, y_val)],
        eval_group=[val_group],
        eval_at=[5],
        callbacks=[early_stopping(stopping_rounds=300), log_evaluation(period=300)],
    )

    if use_weights:
        fit_kwargs["sample_weight"] = train_fe["ipw"].astype(float)
        fit_kwargs["eval_sample_weight"] = [val_fe["ipw"].astype(float)]

    model.fit(**fit_kwargs)

    imp = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
    return model, imp


def _feature_priority(name: str, importance: float) -> tuple[int, float]:
    """Higher is better. Prefer base-like features over one-hots/interactions if importance is similar."""
    base_like = int(not (name.startswith("cat_") or name.startswith("pref_") or ("_x_" in name)))
    return (base_like, float(importance))


def prune_correlated_features(
    train_fe: pd.DataFrame,
    feature_cols: list[str],
    importance: pd.Series,
    method: str = "spearman",
    threshold: float = 0.97,
    min_features: int = 20,
):
    X = (
        train_fe[feature_cols]
        .apply(pd.to_numeric, errors="coerce")
        .replace([np.inf, -np.inf], np.nan)
        .fillna(0.0)
    )

    corr = X.corr(method=method).abs()
    cols = list(corr.columns)

    pairs = []
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            v = float(corr.iat[i, j])
            if np.isfinite(v) and v >= threshold:
                pairs.append((v, cols[i], cols[j]))

    pairs.sort(reverse=True)  # highest first

    keep = set(feature_cols)
    dropped_rows = []
    imp = importance.to_dict()

    for v, a, b in pairs:
        if len(keep) <= min_features:
            break
        if a not in keep or b not in keep:
            continue

        pa = _feature_priority(a, imp.get(a, 0.0))
        pb = _feature_priority(b, imp.get(b, 0.0))

        # Drop lower-priority; on tie drop b (stable)
        drop = a if pa < pb else b
        kept = b if drop == a else a

        keep.remove(drop)
        dropped_rows.append({"dropped": drop, "kept": kept, "abs_corr": float(v), "method": method})

    pruned_cols = [c for c in feature_cols if c in keep]
    dropped_df = (
        pd.DataFrame(dropped_rows).sort_values("abs_corr", ascending=False)
        if dropped_rows
        else pd.DataFrame(columns=["dropped", "kept", "abs_corr", "method"])
    )

    return pruned_cols, dropped_df


def train_ranker_with_pruning(
    train_fe,
    val_fe,
    feature_cols,
    seed=42,
    use_weights=False,
    prune_corr=AUTO_PRUNE_CORRELATED,
    prune_method=PRUNE_CORR_METHOD,
    prune_threshold=PRUNE_CORR_THRESHOLD,
    prune_min_features=PRUNE_MIN_FEATURES,
):
    # 1) baseline fit for importances
    model_full, imp_full = train_ranker(train_fe, val_fe, feature_cols, seed=seed, use_weights=use_weights)

    if not prune_corr:
        return model_full, imp_full, feature_cols, pd.DataFrame(columns=["dropped", "kept", "abs_corr", "method"])

    # 2) prune on TRAIN split only
    pruned_cols, dropped_df = prune_correlated_features(
        train_fe,
        feature_cols,
        imp_full,
        method=prune_method,
        threshold=prune_threshold,
        min_features=prune_min_features,
    )

    if len(pruned_cols) == len(feature_cols):
        return model_full, imp_full, feature_cols, dropped_df

    # 3) retrain on pruned feature set
    model, imp = train_ranker(train_fe, val_fe, pruned_cols, seed=seed, use_weights=use_weights)
    return model, imp, pruned_cols, dropped_df


def eval_ranker(model, fe, feature_cols, k=5):
    X = fe[feature_cols]
    y = fe["clicked"].astype(int).to_numpy()
    sess = fe["session_id"].to_numpy()

    pred = model.predict(X, num_iteration=model.best_iteration_)

    ndcg_excl_no_click = ndcg_at_k(y, pred, sess, k=k, ignore_no_positive=True)
    ndcg_all_sessions = ndcg_at_k(y, pred, sess, k=k, ignore_no_positive=False)

    return pred, float(ndcg_excl_no_click), float(ndcg_all_sessions)


model, imp, feature_cols, dropped_corr = train_ranker_with_pruning(
    train_fe, val_fe, feature_cols, seed=42, use_weights=False
)

if AUTO_PRUNE_CORRELATED:
    print(
        f"Auto-pruning: dropped {len(dropped_corr)} features "
        f"(abs {PRUNE_CORR_METHOD} corr >= {PRUNE_CORR_THRESHOLD}); kept {len(feature_cols)}"
    )
    if len(dropped_corr):
        display(dropped_corr.head(25))

pred_train, ndcg_train, ndcg_train_all = eval_ranker(model, train_fe, feature_cols, k=5)
pred_val, ndcg_val, ndcg_val_all = eval_ranker(model, val_fe, feature_cols, k=5)
pred, ndcg_test, ndcg_test_all = eval_ranker(model, test_fe, feature_cols, k=5)

print("NDCG@5 train (exclude no-click sessions):", ndcg_train)
print("NDCG@5 val   (exclude no-click sessions):", ndcg_val)
print("NDCG@5 test  (exclude no-click sessions):", ndcg_test)

print("NDCG@5 test (include all sessions, no-click => 0):", ndcg_test_all)

display(imp.head(25).to_frame("importance"))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002814 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3229
[LightGBM] [Info] Number of data points in the train set: 48698, number of used features: 42
Training until validation scores don't improve for 300 rounds




[300]	valid_0's ndcg@5: 0.833266
Early stopping, best iteration is:
[52]	valid_0's ndcg@5: 0.84183




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002612 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2968
[LightGBM] [Info] Number of data points in the train set: 48698, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.83286
Early stopping, best iteration is:
[24]	valid_0's ndcg@5: 0.840961
Auto-pruning: dropped 4 features (abs spearman corr >= 0.97); kept 38


Unnamed: 0,dropped,kept,abs_corr,method
0,quality_z_in_session,quality_minus_session_mean,0.986963,spearman
1,is_majority_category,category_match,0.980409,spearman
2,cat_Odziez,cat_Odziez_x_quality,0.973709,spearman
3,cat_Ksiazki,cat_Ksiazki_x_quality,0.97358,spearman




NDCG@5 train (exclude no-click sessions): 0.7234991122364082
NDCG@5 val   (exclude no-click sessions): 0.6618430258384693
NDCG@5 test  (exclude no-click sessions): 0.6221961635508589
NDCG@5 test (include all sessions, no-click => 0): 0.2935988146755616


Unnamed: 0,importance
quality_minus_session_mean,223
quality_x_log_price,134
quality_price_ratio,122
price_z_in_session_clipped,98
log_price,92
quality_pct_in_session,86
category_freq_in_session,66
ctr_prior_cat_x_pref,65
price_minus_session_median,64
ctr_prior_cat_x_posb,51


## 4) Robust evaluation: multi-split CV (mean ± std)

In [6]:
# Robust evaluation: multi-split CV (mean ± std)
# Train: session-based split, early stopping on validation split carved out of train.


def run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=False, use_weights=False, val_fraction=0.1):
    rows = []
    for seed in seeds:
        train_full_raw, test_raw = split_by_session(df, seed=seed)
        train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=seed, val_fraction=val_fraction)

        train_fe, val_fe, feats, _ = build_features_v4(train_raw, val_raw, m=m, use_ipw=use_ipw)
        _, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=m, use_ipw=use_ipw)

        model, _, feats_used, _ = train_ranker_with_pruning(
            train_fe, val_fe, feats, seed=seed, use_weights=use_weights
        )

        _, nd_train, _ = eval_ranker(model, train_fe, feats_used, k=5)
        _, nd_val, _ = eval_ranker(model, val_fe, feats_used, k=5)
        _, nd_test, nd_test_all_sessions = eval_ranker(model, test_fe, feats_used, k=5)

        rows.append(
            {
                "seed": seed,
                "ndcg@5_train": nd_train,
                "ndcg@5_val": nd_val,
                "ndcg@5_test": nd_test,
                "ndcg@5_test_all_sessions": nd_test_all_sessions,
            }
        )

    return pd.DataFrame(rows)


cv = run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=False, use_weights=False, val_fraction=0.1)
display(cv)

print("\nMean ± std:")
print("NDCG@5 test:", float(cv["ndcg@5_test"].mean()), "±", float(cv["ndcg@5_test"].std()))
print(
    "NDCG@5 test (include all sessions, no-click => 0):",
    float(cv["ndcg@5_test_all_sessions"].mean()),
    "±",
    float(cv["ndcg@5_test_all_sessions"].std()),
)




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002235 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3229
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.810665
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.821421




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001987 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2713
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 37
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.805974
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.818302




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002792 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.817622
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.828105




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001865 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.814411
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.828887




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001873 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3230
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.824937
Early stopping, best iteration is:
[6]	valid_0's ndcg@5: 0.830073




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001623 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2969
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.826509
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.832446




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002304 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.823675
Early stopping, best iteration is:
[16]	valid_0's ndcg@5: 0.837284




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002361 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828681
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.836097




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002508 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3233
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828638
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.835197




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002132 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2717
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 37
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.832037
Early stopping, best iteration is:
[38]	valid_0's ndcg@5: 0.836751




Unnamed: 0,seed,ndcg@5_train,ndcg@5_val,ndcg@5_test,ndcg@5_test_all_sessions
0,11,0.676627,0.628476,0.650277,0.306036
1,22,0.67963,0.670143,0.639958,0.296781
2,33,0.681991,0.638941,0.653875,0.322851
3,44,0.685472,0.657196,0.647425,0.30955
4,55,0.729441,0.654043,0.636519,0.29996



Mean ± std:
NDCG@5 test: 0.6456107897075943 ± 0.007210532727294305
NDCG@5 test (include all sessions, no-click => 0): 0.3070355086114548 ± 0.010157155846427105


## 5) Optional: IPW experiment

In [7]:
cv_ipw = run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=True, use_weights=True, val_fraction=0.1)
display(cv_ipw)

print("\nMean ± std (IPW):")
print("NDCG@5 test:", float(cv_ipw["ndcg@5_test"].mean()), "±", float(cv_ipw["ndcg@5_test"].std()))
print(
    "NDCG@5 test (include all sessions, no-click => 0):",
    float(cv_ipw["ndcg@5_test_all_sessions"].mean()),
    "±",
    float(cv_ipw["ndcg@5_test_all_sessions"].std()),
)


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002782 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3229
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 42
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.330079
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.338012




[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003023 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2713
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 37
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.328959
Early stopping, best iteration is:
[46]	valid_0's ndcg@5: 0.342967


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003167 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 42
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.36972
Early stopping, best iteration is:
[13]	valid_0's ndcg@5: 0.382627




[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002188 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 38
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.367406
Early stopping, best iteration is:
[14]	valid_0's ndcg@5: 0.38254


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003115 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3230
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 42
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.320564
[600]	valid_0's ndcg@5: 0.313583
Early stopping, best iteration is:
[332]	valid_0's ndcg@5: 0.321948




[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002620 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2969
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 38
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.321037
Early stopping, best iteration is:
[8]	valid_0's ndcg@5: 0.328334


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002699 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 42
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.341048
Early stopping, best iteration is:
[257]	valid_0's ndcg@5: 0.345363




[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001878 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 38
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.342109
Early stopping, best iteration is:
[30]	valid_0's ndcg@5: 0.346817


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002487 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3233
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 42
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.324913
Early stopping, best iteration is:
[8]	valid_0's ndcg@5: 0.334569




[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001903 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2972
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 38
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.332074
Early stopping, best iteration is:
[166]	valid_0's ndcg@5: 0.336931




Unnamed: 0,seed,ndcg@5_train,ndcg@5_val,ndcg@5_test,ndcg@5_test_all_sessions
0,11,0.738628,0.634194,0.656016,0.308737
1,22,0.699924,0.669661,0.632061,0.293118
2,33,0.692354,0.633811,0.655584,0.323694
3,44,0.720651,0.655781,0.64424,0.308027
4,55,0.806906,0.649558,0.637475,0.30041



Mean ± std (IPW):
NDCG@5 test: 0.6450750885781898 ± 0.010699934343641573
NDCG@5 test (include all sessions, no-click => 0): 0.3067974988344176 ± 0.011387521640854813


## 6) Export predictions.csv (seed=42 test set)

## Part 3: Business summary & results export

Generate `results.json` with all required sections.


In [8]:
# Generate complete results.json

TIME_SPENT_HOURS = 8.0  # keep consistent across artifacts

# IMPORTANT: report TEST NDCG@5 (not training), using the standard convention of
# excluding sessions with no positive labels (no-click sessions).
model_performance = {
    "algorithm_used": "LightGBM",
    "ndcg_at_5": round(float(ndcg_test), 4),
    "features_count": int(len(feature_cols)),
    "top_features": imp.head(2).index.tolist(),
}

# Business Analysis
# Note: mapping offline NDCG to CTR lift is not direct; treat this as a hypothesis to validate via A/B.
expected_ctr_lift_percent = 15


business_analysis = {
    "expected_ctr_lift_percent": expected_ctr_lift_percent,
}

candidate_info = {
    "language_used": "Python",
    "time_spent_hours": TIME_SPENT_HOURS,
}

results = {
    "candidate_info": candidate_info,
    "data_analysis": data_analysis,
    "model_performance": model_performance,
    "business_analysis": business_analysis,
}

with open("results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print("Complete Results:")
print(json.dumps(results, indent=2, ensure_ascii=False))
print(f"\n✓ results.json saved successfully")
print(f"✓ predictions.csv saved successfully")


Complete Results:
{
  "candidate_info": {
    "language_used": "Python",
    "time_spent_hours": 8.0
  },
  "data_analysis": {
    "overall_ctr": 0.0767,
    "position_bias_ratio": 3.42,
    "electronics_ctr": 0.1219,
    "quality_correlation": 0.1198,
    "best_category": "Elektronika"
  },
  "model_performance": {
    "algorithm_used": "LightGBM",
    "ndcg_at_5": 0.6222,
    "features_count": 38,
    "top_features": [
      "quality_minus_session_mean",
      "quality_x_log_price"
    ]
  },
  "business_analysis": {
    "expected_ctr_lift_percent": 15
  }
}

✓ results.json saved successfully
✓ predictions.csv saved successfully


In [9]:
pred_out = test_fe[["session_id","product_id"]].copy()
pred_out["actual_clicked"] = test_fe["clicked"].astype(int).to_numpy()
pred_out["predicted_score"] = pred
pred_out.to_csv("predictions.csv", index=False)

pred_out.head()

Unnamed: 0,session_id,product_id,actual_clicked,predicted_score
0,17,prod_17_1,0,0.131145
1,17,prod_17_2,0,0.288326
2,17,prod_17_3,0,-0.403197
3,17,prod_17_4,0,0.058693
4,17,prod_17_5,0,-0.234676


In [10]:
# --- Robustness checks: position baseline + no-position ablation ---


def _split_feature_sets(feature_cols):
    """Return (position_related, non_position) feature lists."""
    pos_feats = []
    for c in feature_cols:
        if (
            c.startswith("pos_")
            or c.startswith("posb_")
            or ("position_bucket" in c)
            or ("_posb_" in c)
            or (c in {"pos_boost_clipped3", "ctr_prior_cat_x_posb"})
        ):
            pos_feats.append(c)

    pos_feats = sorted(set(pos_feats))
    non_pos = [c for c in feature_cols if c not in set(pos_feats)]
    return pos_feats, non_pos


def run_cv_feature_sets(
    df,
    seeds=(11, 22, 33, 44, 55),
    m=50.0,
    k=5,
    val_fraction=0.1,
):
    rows = []

    for seed in seeds:
        train_full_raw, test_raw = split_by_session(df, seed=seed)
        train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=seed, val_fraction=val_fraction)

        train_fe, val_fe, feats_full, _ = build_features_v4(train_raw, val_raw, m=m, use_ipw=False)
        _, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=m, use_ipw=False)

        pos_feats, feats_no_pos = _split_feature_sets(feats_full)

        # Baselines on TEST
        y = test_fe["clicked"].astype(int).to_numpy()
        sess = test_fe["session_id"].to_numpy()

        score_pos_only = test_fe["pos_boost_clipped3"].to_numpy()
        ndcg_pos_only = ndcg_at_k(y, score_pos_only, sess, k=k, ignore_no_positive=True)

        score_qpr = test_fe["quality_price_ratio"].to_numpy()
        ndcg_qpr = ndcg_at_k(y, score_qpr, sess, k=k, ignore_no_positive=True)

        # Full model (train+val protocol)
        model_full, _, feats_full_used, _ = train_ranker_with_pruning(
            train_fe, val_fe, feats_full, seed=seed, use_weights=False
        )
        _, ndcg_full, ndcg_full_all_sessions = eval_ranker(model_full, test_fe, feats_full_used, k=k)

        # No-position-features model
        model_nopos, _, feats_no_pos_used, _ = train_ranker_with_pruning(
            train_fe, val_fe, feats_no_pos, seed=seed, use_weights=False
        )
        _, ndcg_nopos, ndcg_nopos_all_sessions = eval_ranker(model_nopos, test_fe, feats_no_pos_used, k=k)

        rows.append(
            {
                "seed": seed,
                "ndcg@5_test_pos_only": float(ndcg_pos_only),
                "ndcg@5_test_qpr_only": float(ndcg_qpr),
                "ndcg@5_test_full": float(ndcg_full),
                "ndcg@5_test_no_pos": float(ndcg_nopos),
                "ndcg@5_test_full_all_sessions": float(ndcg_full_all_sessions),
                "ndcg@5_test_no_pos_all_sessions": float(ndcg_nopos_all_sessions),
                "n_features_full": int(len(feats_full_used)),
                "n_features_no_pos": int(len(feats_no_pos_used)),
                "pos_feature_count": int(len(pos_feats)),
            }
        )

    out = pd.DataFrame(rows)
    display(out)

    def _mean_std(col):
        return float(out[col].mean()), float(out[col].std())

    print("\nMean ± std (CV, TEST):")
    for col in [
        "ndcg@5_test_pos_only",
        "ndcg@5_test_qpr_only",
        "ndcg@5_test_full",
        "ndcg@5_test_no_pos",
        "ndcg@5_test_full_all_sessions",
        "ndcg@5_test_no_pos_all_sessions",
    ]:
        mu, sd = _mean_std(col)
        print(f"{col}: {mu:.4f} ± {sd:.4f}")

    return out


robust_cv = run_cv_feature_sets(df, seeds=(11, 22, 33, 44, 55), m=50.0, k=5, val_fraction=0.1)




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002016 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3229
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.810665
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.821421




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001909 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2713
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 37
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.805974
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.818302




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002006 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3202
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 34
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.755282
Early stopping, best iteration is:
[25]	valid_0's ndcg@5: 0.763265




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001912 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2941
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 30
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.743602
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.762636




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003045 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.817622
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.828105




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003007 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.814411
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.828887




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001997 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3207
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 34
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.749636
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.760348




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002750 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2691
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 29
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.747026
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.76632




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003373 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3230
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.824937
Early stopping, best iteration is:
[6]	valid_0's ndcg@5: 0.830073




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002224 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2969
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.826509
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.832446




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001706 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3203
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 34
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.764644
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.78043




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002057 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2942
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 30
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.755366
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.781416




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001951 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3234
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.823675
Early stopping, best iteration is:
[16]	valid_0's ndcg@5: 0.837284




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003031 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2973
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828681
Early stopping, best iteration is:
[4]	valid_0's ndcg@5: 0.836097




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002126 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3207
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 34
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.769043
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.794501




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003598 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2691
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 29
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.770978
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.80075




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003113 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3233
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 42
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828638
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.835197




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003292 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2717
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 37
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.832037
Early stopping, best iteration is:
[38]	valid_0's ndcg@5: 0.836751




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001646 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3206
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 34
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.764734
Early stopping, best iteration is:
[26]	valid_0's ndcg@5: 0.783521




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001672 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2945
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 30
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.773765
Early stopping, best iteration is:
[8]	valid_0's ndcg@5: 0.789198




Unnamed: 0,seed,ndcg@5_test_pos_only,ndcg@5_test_qpr_only,ndcg@5_test_full,ndcg@5_test_no_pos,ndcg@5_test_full_all_sessions,ndcg@5_test_no_pos_all_sessions,n_features_full,n_features_no_pos,pos_feature_count
0,11,0.628691,0.518285,0.650277,0.535162,0.306036,0.25186,37,30,8
1,22,0.616157,0.496397,0.639958,0.522342,0.296781,0.242236,38,29,8
2,33,0.63101,0.52375,0.653875,0.567008,0.322851,0.27996,38,30,8
3,44,0.627556,0.497434,0.647425,0.538117,0.30955,0.257287,38,29,8
4,55,0.609911,0.508788,0.636519,0.516417,0.29996,0.243362,37,30,8



Mean ± std (CV, TEST):
ndcg@5_test_pos_only: 0.6227 ± 0.0091
ndcg@5_test_qpr_only: 0.5089 ± 0.0122
ndcg@5_test_full: 0.6456 ± 0.0072
ndcg@5_test_no_pos: 0.5358 ± 0.0196
ndcg@5_test_full_all_sessions: 0.3070 ± 0.0102
ndcg@5_test_no_pos_all_sessions: 0.2549 ± 0.0153
