# Learning-to-Rank Model for E-commerce Search

This notebook implements a Learning-to-Rank solution for ranking products in an e-commerce search engine.

## Tasks
1. **Part 1: Data analysis** - Calculate key data analysis metrics
2. **Part 2: Learning-to-rank model** - Build and evaluate the ranking model
3. **Part 3: Business summary** - Business analysis and recommendations

## Model features
- **Required features**:
  - `position_boost = 1/position` (clipped at position 3)
  - `log_price = log(price_pln + 1)`
  - `quality_price_ratio = quality_score / log_price`
  - `category_match = (category == user_preferred_category) ? 1 : 0`
  - Additional session-relative and interaction features

- **Advanced features**:
  - Session-relative features (price/quality rankings within session)
  - Smoothed CTR priors computed on the training split only (Bayesian m-estimate)
  - Position-bias control via buckets
  - Category interactions
  - Robust cross-validation evaluation

## Outputs
- `results.json` - Complete analysis results
- `predictions.csv` - Model predictions on the test split
- `solution_summary.md` - Solution summary

**Expected input:** `search_sessions.csv` in the same directory.


In [22]:
# pip install pandas numpy scikit-learn lightgbm matplotlib

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from lightgbm import LGBMRanker, early_stopping, log_evaluation

pd.set_option('display.max_columns', 400)
pd.set_option('display.width', 200)

DATA_PATH = "search_sessions.csv"
assert os.path.exists(DATA_PATH), f"Missing {DATA_PATH}. Put it next to this notebook."

df = pd.read_csv(DATA_PATH)

# Basic coercions
df["clicked"] = df["clicked"].astype(int)
df["position"] = df["position"].astype(int)
df["price_pln"] = df["price_pln"].astype(float)
df["quality_score"] = df["quality_score"].astype(float)

print(f"Dataset shape: {df.shape}")
print(f"Sessions: {df['session_id'].nunique()}")
print(f"Products: {df['product_id'].nunique()}")
df.head()

Dataset shape: (67651, 8)
Sessions: 8000
Products: 67651


Unnamed: 0,session_id,product_id,position,clicked,price_pln,category,quality_score,user_preferred_category
0,0,prod_0_1,1,1,40.05,Elektronika,0.363,Elektronika
1,0,prod_0_2,2,0,249.63,Elektronika,0.547,Elektronika
2,0,prod_0_3,3,0,679.19,Elektronika,0.696,Elektronika
3,0,prod_0_4,4,0,493.38,Elektronika,0.1,Elektronika
4,0,prod_0_5,5,0,2982.39,Odziez,0.384,Elektronika


## Part 1: Data analysis

Calculate the 5 key data analysis metrics required by the instructions.


In [23]:
# Part 1: Data Analysis
# Calculate the 5 required metrics

# 1. overall_ctr: All clicks / all impressions
overall_ctr = df["clicked"].mean()

# 2. position_bias_ratio: CTR position 1 / CTR position 5
ctr_by_position = df.groupby("position")["clicked"].mean()
position_bias_ratio = ctr_by_position[1] / ctr_by_position[5] if 5 in ctr_by_position.index and ctr_by_position[5] > 0 else 0.0

# 3. electronics_ctr: CTR for category "Elektronika"
electronics_ctr = df[df["category"] == "Elektronika"]["clicked"].mean()

# 4. quality_correlation: Correlation between quality_score and clicked
quality_correlation = df["quality_score"].corr(df["clicked"])

# 5. best_category: Category with highest CTR
ctr_by_category = df.groupby("category")["clicked"].mean()
best_category = ctr_by_category.idxmax()

# Store results
data_analysis = {
    "overall_ctr": round(float(overall_ctr), 4),
    "position_bias_ratio": round(float(position_bias_ratio), 2),
    "electronics_ctr": round(float(electronics_ctr), 4),
    "quality_correlation": round(float(quality_correlation), 4),
    "best_category": str(best_category)
}

print("Data Analysis Results:")
print(json.dumps(data_analysis, indent=2))
print(f"\nCTR by position:\n{ctr_by_position}")
print(f"\nCTR by category:\n{ctr_by_category}")


Data Analysis Results:
{
  "overall_ctr": 0.0767,
  "position_bias_ratio": 3.42,
  "electronics_ctr": 0.1219,
  "quality_correlation": 0.1198,
  "best_category": "Elektronika"
}

CTR by position:
position
1     0.198625
2     0.111000
3     0.081500
4     0.067375
5     0.058000
6     0.045318
7     0.042628
8     0.034455
9     0.037443
10    0.029592
11    0.027211
12    0.032191
Name: clicked, dtype: float64

CTR by category:
category
Elektronika    0.121921
Ksiazki        0.036991
Odziez         0.070694
Name: clicked, dtype: float64


## Part 2: Learning-to-rank model

### Utilities and helper functions

In [24]:
EPS = 1e-6

# --- Metrics ---

def ndcg_at_k(y_true, y_score, group_ids, k=5, ignore_no_positive=True):
    """Mean NDCG@k over groups.

    - ignore_no_positive=True: skip sessions with no positive labels (IDCG=0)
      This matches common LTR evaluation practice and typical library behavior.
    - ignore_no_positive=False: include such sessions as 0.0.
    """
    gdf = pd.DataFrame({"g": group_ids, "y": y_true, "s": y_score})
    ndcgs = []
    for _, part in gdf.groupby("g", sort=False):
        part = part.sort_values("s", ascending=False)
        rel = part["y"].to_numpy()[:k]

        discounts = np.log2(np.arange(2, rel.size + 2))
        dcg = ((2**rel - 1) / discounts).sum()

        ideal = np.sort(part["y"].to_numpy())[::-1][:k]
        idcg = ((2**ideal - 1) / discounts[: ideal.size]).sum()

        if idcg == 0:
            if ignore_no_positive:
                continue
            ndcgs.append(0.0)
        else:
            ndcgs.append(float(dcg / idcg))

    return float(np.mean(ndcgs)) if ndcgs else 0.0


# --- Splits & ordering ---

def finalize_order(df):
    """Ensure stable ordering/contiguity by session_id before group size computation."""
    return df.sort_values(["session_id", "position", "product_id"]).reset_index(drop=True)


def group_sizes(df, group_col="session_id"):
    # Requires df already sorted by group_col
    return df.groupby(group_col, sort=False).size().to_numpy()


def split_by_session(df, seed=42, test_size=0.2):
    """Split by session_id (no leakage). Returns (train_df, test_df)."""
    sessions = df["session_id"].unique()
    train_sess, test_sess = train_test_split(sessions, test_size=test_size, random_state=seed, shuffle=True)
    train_df = df[df["session_id"].isin(train_sess)].copy()
    test_df = df[df["session_id"].isin(test_sess)].copy()
    return finalize_order(train_df), finalize_order(test_df)


def split_train_val_from_train(train_full_df, seed=42, val_fraction=0.1):
    """Create an internal validation split from the train portion only."""
    tr, va = split_by_session(train_full_df, seed=seed, test_size=val_fraction)
    return tr, va


# --- Feature utilities ---

def m_estimate_ctr(train: pd.DataFrame, key_cols, label_col="clicked", m=50.0):
    prior = float(train[label_col].mean())
    agg = train.groupby(key_cols, observed=False)[label_col].agg(["sum", "count"]).reset_index()
    agg["ctr_prior"] = (agg["sum"] + m * prior) / (agg["count"] + m)
    return agg[key_cols + ["ctr_prior"]], float(prior)


def clip_series(s, lo=-3.0, hi=3.0):
    return s.clip(lower=lo, upper=hi)


def zscore_in_group(x: pd.Series) -> pd.Series:
    mu = x.mean()
    std = x.std(ddof=0)
    if (not np.isfinite(std)) or std <= 0:
        return pd.Series(np.zeros(len(x), dtype=float), index=x.index)
    out = (x - mu) / std
    return out.replace([np.inf, -np.inf], 0.0).fillna(0.0)


## 2) Feature Engineering v4

In [25]:
def add_position_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["position_bucket"] = pd.cut(
        out["position"],
        bins=[0,3,6,10],
        labels=["top3","mid46","bot710"],
        include_lowest=True
    )
    pb = pd.get_dummies(out["position_bucket"], prefix="posb")
    out = pd.concat([out, pb], axis=1)
    out["pos_boost_clipped3"] = 1.0 / out["position"].clip(upper=3).astype(float)
    return out

def session_relative_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["session_size"] = out.groupby("session_id")["product_id"].transform("size")

    # price: lower is "better" (rank 1 = cheapest)
    out["price_rank_in_session"] = out.groupby("session_id")["price_pln"].rank(method="average", ascending=True)
    out["price_pct_in_session"] = out["price_rank_in_session"] / out["session_size"].replace(0, 1)
    out["price_min_in_session"] = out.groupby("session_id")["price_pln"].transform("min")
    out["price_max_in_session"] = out.groupby("session_id")["price_pln"].transform("max")
    out["is_cheapest_in_session"] = (out["price_pln"] == out["price_min_in_session"]).astype(int)
    out["is_most_expensive_in_session"] = (out["price_pln"] == out["price_max_in_session"]).astype(int)

    out["price_median_in_session"] = out.groupby("session_id")["price_pln"].transform("median")
    out["price_minus_session_median"] = out["price_pln"] - out["price_median_in_session"]

    out["price_z_in_session"] = out.groupby("session_id")["price_pln"].transform(zscore_in_group).astype(float)
    out["price_z_in_session_clipped"] = clip_series(out["price_z_in_session"], -3, 3)

    # quality: higher is "better" (rank 1 = highest quality)
    out["quality_rank_in_session"] = out.groupby("session_id")["quality_score"].rank(method="average", ascending=False)
    out["quality_pct_in_session"] = out["quality_rank_in_session"] / out["session_size"].replace(0, 1)
    out["quality_min_in_session"] = out.groupby("session_id")["quality_score"].transform("min")
    out["quality_max_in_session"] = out.groupby("session_id")["quality_score"].transform("max")
    out["is_best_quality_in_session"] = (out["quality_score"] == out["quality_max_in_session"]).astype(int)
    out["is_worst_quality_in_session"] = (out["quality_score"] == out["quality_min_in_session"]).astype(int)

    out["quality_mean_in_session"] = out.groupby("session_id")["quality_score"].transform("mean")
    out["quality_minus_session_mean"] = out["quality_score"] - out["quality_mean_in_session"]

    out["quality_z_in_session"] = out.groupby("session_id")["quality_score"].transform(zscore_in_group).astype(float)

    out["category_freq_in_session"] = out.groupby(["session_id", "category"])["product_id"].transform("size")

    counts = out.groupby(["session_id", "category"], observed=False).size().reset_index(name="cnt")
    counts["max_cnt"] = counts.groupby("session_id")["cnt"].transform("max")
    counts["is_majority_category"] = (counts["cnt"] == counts["max_cnt"]).astype(int)
    out = out.merge(counts[["session_id", "category", "is_majority_category"]], on=["session_id", "category"], how="left")
    out["is_majority_category"] = out["is_majority_category"].fillna(0).astype(int)

    return out

def add_core_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    out["log_price"] = np.log(out["price_pln"] + 1.0)
    out["category_match"] = (out["category"] == out["user_preferred_category"]).astype(int)
    out["quality_price_ratio"] = out["quality_score"] / out["log_price"].replace(0.0, EPS)
    return out

def one_hot_categories(train: pd.DataFrame, test: pd.DataFrame):
    cat_train = pd.get_dummies(train["category"], prefix="cat")
    pref_train = pd.get_dummies(train["user_preferred_category"], prefix="pref")

    cat_test = pd.get_dummies(test["category"], prefix="cat").reindex(columns=cat_train.columns, fill_value=0)
    pref_test = pd.get_dummies(test["user_preferred_category"], prefix="pref").reindex(columns=pref_train.columns, fill_value=0)

    train2 = pd.concat([train, cat_train, pref_train], axis=1)
    test2  = pd.concat([test, cat_test, pref_test], axis=1)
    return train2, test2, list(cat_train.columns), list(pref_train.columns)

def add_ctr_priors(train: pd.DataFrame, test: pd.DataFrame, m=50.0):
    global_ctr = train["clicked"].mean()

    map_cat, _ = m_estimate_ctr(train, ["category"], m=m)
    train = train.merge(map_cat.rename(columns={"ctr_prior":"ctr_prior_category"}), on=["category"], how="left")
    test  = test.merge(map_cat.rename(columns={"ctr_prior":"ctr_prior_category"}), on=["category"], how="left")

    map_cxp, _ = m_estimate_ctr(train, ["category","user_preferred_category"], m=m)
    train = train.merge(map_cxp.rename(columns={"ctr_prior":"ctr_prior_cat_x_pref"}), on=["category","user_preferred_category"], how="left")
    test  = test.merge(map_cxp.rename(columns={"ctr_prior":"ctr_prior_cat_x_pref"}), on=["category","user_preferred_category"], how="left")

    map_pref, _ = m_estimate_ctr(train, ["user_preferred_category"], m=m)
    train = train.merge(map_pref.rename(columns={"ctr_prior":"ctr_prior_pref"}), on=["user_preferred_category"], how="left")
    test  = test.merge(map_pref.rename(columns={"ctr_prior":"ctr_prior_pref"}), on=["user_preferred_category"], how="left")

    map_cpb, _ = m_estimate_ctr(train, ["category","position_bucket"], m=m)
    train = train.merge(map_cpb.rename(columns={"ctr_prior":"ctr_prior_cat_x_posb"}), on=["category","position_bucket"], how="left")
    test  = test.merge(map_cpb.rename(columns={"ctr_prior":"ctr_prior_cat_x_posb"}), on=["category","position_bucket"], how="left")

    for d in (train, test):
        d["ctr_prior_category"] = d["ctr_prior_category"].fillna(global_ctr)
        d["ctr_prior_cat_x_pref"] = d["ctr_prior_cat_x_pref"].fillna(global_ctr)
        d["ctr_prior_pref"] = d["ctr_prior_pref"].fillna(global_ctr)
        d["ctr_prior_cat_x_posb"] = d["ctr_prior_cat_x_posb"].fillna(global_ctr)

    return train, test, float(global_ctr)

def add_interactions(d: pd.DataFrame, cat_cols) -> pd.DataFrame:
    out = d.copy()

    out["match_x_posb_top3"] = out["category_match"] * out.get("posb_top3", 0)
    out["match_x_posb_mid46"] = out["category_match"] * out.get("posb_mid46", 0)
    out["match_x_posb_bot710"] = out["category_match"] * out.get("posb_bot710", 0)

    out["quality_rank_x_match"] = out["quality_rank_in_session"] * out["category_match"]
    out["price_rank_x_match"] = out["price_rank_in_session"] * out["category_match"]
    out["quality_pct_x_match"] = out["quality_pct_in_session"] * out["category_match"]
    out["price_pct_x_match"] = out["price_pct_in_session"] * out["category_match"]

    out["quality_x_log_price"] = out["quality_score"] * out["log_price"]

    for c in cat_cols:
        out[f"{c}_x_quality"] = out[c] * out["quality_score"]
        out[f"{c}_x_log_price"] = out[c] * out["log_price"]

    return out

def build_features_v4(train_raw: pd.DataFrame, test_raw: pd.DataFrame, m=50.0, use_ipw=False):
    train = add_core_features(train_raw)
    test  = add_core_features(test_raw)

    train = add_position_features(train)
    test  = add_position_features(test)

    train = session_relative_features(train)
    test  = session_relative_features(test)

    train, test, cat_cols, pref_cols = one_hot_categories(train, test)

    train, test, global_ctr = add_ctr_priors(train, test, m=m)

    train = add_interactions(train, cat_cols)
    test  = add_interactions(test, cat_cols)

    if use_ipw:
        prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
        prop = {str(k): v for k, v in prop.items()}  # Convert dict keys to strings
        p_train = train["position_bucket"].astype(str).map(prop).astype(float).fillna(train["clicked"].mean()).clip(0.01, 0.99)
        p_test  = test["position_bucket"].astype(str).map(prop).astype(float).fillna(train["clicked"].mean()).clip(0.01, 0.99)
        train["ipw"] = 1.0 / p_train
        test["ipw"] = 1.0 / p_test
    else:
        train["ipw"] = 1.0
        test["ipw"] = 1.0

    base = [
        "pos_boost_clipped3","posb_top3","posb_mid46","posb_bot710",
        "log_price","quality_price_ratio","category_match",
        "price_rank_in_session","quality_rank_in_session",
        "price_pct_in_session","quality_pct_in_session",
        "price_minus_session_median","quality_minus_session_mean",
        "price_z_in_session_clipped","quality_z_in_session",
        "is_cheapest_in_session","is_most_expensive_in_session",
        "is_best_quality_in_session","is_worst_quality_in_session",
        "session_size","category_freq_in_session","is_majority_category",
        "ctr_prior_category","ctr_prior_cat_x_pref","ctr_prior_pref","ctr_prior_cat_x_posb",
        "match_x_posb_top3","match_x_posb_mid46","match_x_posb_bot710",
        "quality_rank_x_match","price_rank_x_match","quality_pct_x_match","price_pct_x_match",
    ]

    extra = [c for c in train.columns if c.startswith("cat_") or c.startswith("pref_") or c.endswith("_x_quality") or c.endswith("_x_log_price")]
    feature_cols = base + extra

    for c in feature_cols:
        if c not in train.columns: train[c] = 0
        if c not in test.columns: test[c] = 0

    # Critical: enforce contiguity by session_id before group size computation
    train = finalize_order(train)
    test  = finalize_order(test)

    return train, test, feature_cols, global_ctr

# 80/20 split by session_id (as required). Validation is taken from the 80% train portion only.
train_full_raw, test_raw = split_by_session(df, seed=42)
train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=42, val_fraction=0.1)

# Fit features on train only; apply to val/test (priors computed on train only)
train_fe, val_fe, feature_cols, global_ctr = build_features_v4(train_raw, val_raw, m=50.0, use_ipw=False)
_, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=50.0, use_ipw=False)

print("Feature count:", len(feature_cols))
train_fe[["session_id","product_id","clicked"] + feature_cols[:15]].head()

Feature count: 46


Unnamed: 0,session_id,product_id,clicked,pos_boost_clipped3,posb_top3,posb_mid46,posb_bot710,log_price,quality_price_ratio,category_match,price_rank_in_session,quality_rank_in_session,price_pct_in_session,quality_pct_in_session,price_minus_session_median,quality_minus_session_mean,price_z_in_session_clipped,quality_z_in_session
0,0,prod_0_1,1,1.0,True,False,False,3.714791,0.097717,1,1.0,9.0,0.090909,0.818182,-453.33,-0.146364,-0.697457,-0.636561
1,0,prod_0_2,0,0.5,True,False,False,5.523978,0.099023,1,5.0,6.0,0.454545,0.545455,-243.75,0.037636,-0.431507,0.163687
2,0,prod_0_3,0,0.333333,True,False,False,6.522372,0.10671,1,10.0,3.0,0.909091,0.272727,185.81,0.186636,0.11359,0.811715
3,0,prod_0_4,0,0.333333,False,True,False,6.203304,0.01612,1,6.0,11.0,0.545455,1.0,0.0,-0.409364,-0.122197,-1.780395
4,0,prod_0_5,0,0.333333,False,True,False,8.000816,0.047995,0,11.0,8.0,1.0,0.727273,2489.01,-0.125364,3.0,-0.545229


## 3) Train LightGBM Ranker (stability-focused)

In [26]:
def train_ranker(train_fe, val_fe, feature_cols, seed=42, use_weights=False):
    """Train on train_fe, early-stop on val_fe. Do not use the final test set for model selection."""
    X_train = train_fe[feature_cols]
    y_train = train_fe["clicked"].astype(int)
    X_val = val_fe[feature_cols]
    y_val = val_fe["clicked"].astype(int)

    train_group = group_sizes(train_fe, "session_id")
    val_group = group_sizes(val_fe, "session_id")

    model = LGBMRanker(
        objective="lambdarank",
        metric="ndcg",
        ndcg_eval_at=[5],
        n_estimators=8000,
        learning_rate=0.02,
        num_leaves=63,
        min_data_in_leaf=200,
        min_child_samples=None,  # Explicitly disable to avoid warning when min_data_in_leaf is set
        subsample=0.85,
        colsample_bytree=0.85,
        reg_lambda=2.0,
        reg_alpha=0.5,
        random_state=seed,
    )

    fit_kwargs = dict(
        X=X_train,
        y=y_train,
        group=train_group,
        eval_set=[(X_val, y_val)],
        eval_group=[val_group],
        eval_at=[5],
        callbacks=[early_stopping(stopping_rounds=300), log_evaluation(period=300)],
    )

    if use_weights:
        fit_kwargs["sample_weight"] = train_fe["ipw"].astype(float)
        fit_kwargs["eval_sample_weight"] = [val_fe["ipw"].astype(float)]

    model.fit(**fit_kwargs)

    imp = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
    return model, imp


def eval_ranker(model, fe, feature_cols, k=5):
    X = fe[feature_cols]
    y = fe["clicked"].astype(int).to_numpy()
    sess = fe["session_id"].to_numpy()

    pred = model.predict(X, num_iteration=model.best_iteration_)

    ndcg_excl_no_click = ndcg_at_k(y, pred, sess, k=k, ignore_no_positive=True)
    ndcg_all_sessions = ndcg_at_k(y, pred, sess, k=k, ignore_no_positive=False)

    return pred, float(ndcg_excl_no_click), float(ndcg_all_sessions)


model, imp = train_ranker(train_fe, val_fe, feature_cols, seed=42, use_weights=False)

pred_train, ndcg_train, ndcg_train_all = eval_ranker(model, train_fe, feature_cols, k=5)
pred_val, ndcg_val, ndcg_val_all = eval_ranker(model, val_fe, feature_cols, k=5)
pred, ndcg_test, ndcg_test_all = eval_ranker(model, test_fe, feature_cols, k=5)

print("NDCG@5 train (exclude no-click sessions):", ndcg_train)
print("NDCG@5 val   (exclude no-click sessions):", ndcg_val)
print("NDCG@5 test  (exclude no-click sessions):", ndcg_test)

print("NDCG@5 test (include all sessions, no-click => 0):", ndcg_test_all)

display(imp.head(25).to_frame("importance"))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007895 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3743
[LightGBM] [Info] Number of data points in the train set: 48698, number of used features: 46
Training until validation scores don't improve for 300 rounds




[300]	valid_0's ndcg@5: 0.830477
Early stopping, best iteration is:
[16]	valid_0's ndcg@5: 0.843408




NDCG@5 train (exclude no-click sessions): 0.7154509526699573
NDCG@5 val   (exclude no-click sessions): 0.667047105041002
NDCG@5 test  (exclude no-click sessions): 0.6189542697351208
NDCG@5 test (include all sessions, no-click => 0): 0.29206904603126005


Unnamed: 0,importance
quality_minus_session_mean,125
quality_x_log_price,73
quality_z_in_session,68
quality_price_ratio,64
log_price,52
price_z_in_session_clipped,51
category_freq_in_session,44
cat_Elektronika_x_quality,43
pos_boost_clipped3,41
ctr_prior_cat_x_pref,41


## 4) Robust evaluation: multi-split CV (mean ± std)

In [27]:
# Robust evaluation: multi-split CV (mean ± std)
# Train: session-based split, early stopping on validation split carved out of train.


def run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=False, use_weights=False, val_fraction=0.1):
    rows = []
    for seed in seeds:
        train_full_raw, test_raw = split_by_session(df, seed=seed)
        train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=seed, val_fraction=val_fraction)

        train_fe, val_fe, feats, _ = build_features_v4(train_raw, val_raw, m=m, use_ipw=use_ipw)
        _, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=m, use_ipw=use_ipw)

        model, _ = train_ranker(train_fe, val_fe, feats, seed=seed, use_weights=use_weights)

        _, nd_train, _ = eval_ranker(model, train_fe, feats, k=5)
        _, nd_val, _ = eval_ranker(model, val_fe, feats, k=5)
        _, nd_test, nd_test_all_sessions = eval_ranker(model, test_fe, feats, k=5)

        rows.append(
            {
                "seed": seed,
                "ndcg@5_train": nd_train,
                "ndcg@5_val": nd_val,
                "ndcg@5_test": nd_test,
                "ndcg@5_test_all_sessions": nd_test_all_sessions,
            }
        )

    return pd.DataFrame(rows)


cv = run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=False, use_weights=False, val_fraction=0.1)
display(cv)

print("\nMean ± std:")
print("NDCG@5 test:", float(cv["ndcg@5_test"].mean()), "±", float(cv["ndcg@5_test"].std()))
print(
    "NDCG@5 test (include all sessions, no-click => 0):",
    float(cv["ndcg@5_test_all_sessions"].mean()),
    "±",
    float(cv["ndcg@5_test_all_sessions"].std()),
)




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003048 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3743
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.8132
Early stopping, best iteration is:
[18]	valid_0's ndcg@5: 0.822269




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002350 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.813272
Early stopping, best iteration is:
[46]	valid_0's ndcg@5: 0.833424




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002517 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3744
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.825149
Early stopping, best iteration is:
[232]	valid_0's ndcg@5: 0.829503




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002086 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.825377
Early stopping, best iteration is:
[27]	valid_0's ndcg@5: 0.83452




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002821 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3747
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828888
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.836103




Unnamed: 0,seed,ndcg@5_train,ndcg@5_val,ndcg@5_test,ndcg@5_test_all_sessions
0,11,0.716305,0.636589,0.653811,0.3077
1,22,0.738752,0.678889,0.640609,0.297082
2,33,0.829972,0.632598,0.649975,0.320925
3,44,0.723324,0.653897,0.658665,0.314924
4,55,0.684299,0.652669,0.630915,0.297319



Mean ± std:
NDCG@5 test: 0.6467950133987932 ± 0.011075770056308545
NDCG@5 test (include all sessions, no-click => 0): 0.3075900633891254 ± 0.01057763461928927


## 5) Optional: IPW experiment

In [28]:
cv_ipw = run_cv(df, seeds=(11, 22, 33, 44, 55), m=50.0, use_ipw=True, use_weights=True, val_fraction=0.1)
display(cv_ipw)

print("\nMean ± std (IPW):")
print("NDCG@5 test:", float(cv_ipw["ndcg@5_test"].mean()), "±", float(cv_ipw["ndcg@5_test"].std()))
print(
    "NDCG@5 test (include all sessions, no-click => 0):",
    float(cv_ipw["ndcg@5_test_all_sessions"].mean()),
    "±",
    float(cv_ipw["ndcg@5_test_all_sessions"].std()),
)


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001968 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3743
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 46
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.335299
Early stopping, best iteration is:
[76]	valid_0's ndcg@5: 0.34277


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002048 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 46
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.377162
Early stopping, best iteration is:
[32]	valid_0's ndcg@5: 0.384331


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002175 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3744
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 46
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.318428
Early stopping, best iteration is:
[201]	valid_0's ndcg@5: 0.3221


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002639 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 46
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.336977
Early stopping, best iteration is:
[33]	valid_0's ndcg@5: 0.34448


  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()
  prop = train.groupby("position_bucket")["clicked"].mean().to_dict()


[LightGBM] [Info] Calculating query weights...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001822 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3747
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 46
[LightGBM] [Info] Calculating query weights...
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.33015
Early stopping, best iteration is:
[2]	valid_0's ndcg@5: 0.338273




Unnamed: 0,seed,ndcg@5_train,ndcg@5_val,ndcg@5_test,ndcg@5_test_all_sessions
0,11,0.764005,0.634068,0.657799,0.309577
1,22,0.725196,0.673884,0.639176,0.296418
2,33,0.827802,0.621303,0.649767,0.320823
3,44,0.726547,0.651622,0.651883,0.311682
4,55,0.673399,0.653583,0.635847,0.299643



Mean ± std (IPW):
NDCG@5 test: 0.6468944992685823 ± 0.009133700829567545
NDCG@5 test (include all sessions, no-click => 0): 0.3076283373258502 ± 0.009794965101263674


## 6) Export predictions.csv (seed=42 test set)

## Part 3: Business summary & results export

Generate `results.json` with all required sections.


In [29]:
# Generate complete results.json

TIME_SPENT_HOURS = 8.0  # keep consistent across artifacts

# IMPORTANT: report TEST NDCG@5 (not training), using the standard convention of
# excluding sessions with no positive labels (no-click sessions).
model_performance = {
    "algorithm_used": "LightGBM",
    "ndcg_at_5": round(float(ndcg_test), 4),
    "features_count": int(len(feature_cols)),
    "top_features": imp.head(2).index.tolist(),
}

# Business Analysis
# Note: mapping offline NDCG to CTR lift is not direct; treat this as a hypothesis to validate via A/B.
expected_ctr_lift_percent = 15

main_risk = (
    "Position bias amplification (features include position; offline clicks are position-biased). "
    "Validate via A/B test and consider debiasing/counterfactual training."
)
recommendation = "test"

business_analysis = {
    "expected_ctr_lift_percent": expected_ctr_lift_percent,
    "main_risk": main_risk,
    "recommendation": recommendation,
}

candidate_info = {
    "language_used": "Python",
    "time_spent_hours": TIME_SPENT_HOURS,
}

results = {
    "candidate_info": candidate_info,
    "data_analysis": data_analysis,
    "model_performance": model_performance,
    "business_analysis": business_analysis,
}

with open("results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print("Complete Results:")
print(json.dumps(results, indent=2, ensure_ascii=False))
print(f"\n✓ results.json saved successfully")
print(f"✓ predictions.csv saved successfully")


Complete Results:
{
  "candidate_info": {
    "language_used": "Python",
    "time_spent_hours": 8.0
  },
  "data_analysis": {
    "overall_ctr": 0.0767,
    "position_bias_ratio": 3.42,
    "electronics_ctr": 0.1219,
    "quality_correlation": 0.1198,
    "best_category": "Elektronika"
  },
  "model_performance": {
    "algorithm_used": "LightGBM",
    "ndcg_at_5": 0.619,
    "features_count": 46,
    "top_features": [
      "quality_minus_session_mean",
      "quality_x_log_price"
    ]
  },
  "business_analysis": {
    "expected_ctr_lift_percent": 15,
    "main_risk": "Position bias amplification (features include position; offline clicks are position-biased). Validate via A/B test and consider debiasing/counterfactual training.",
    "recommendation": "test"
  }
}

✓ results.json saved successfully
✓ predictions.csv saved successfully


In [30]:
pred_out = test_fe[["session_id","product_id"]].copy()
pred_out["actual_clicked"] = test_fe["clicked"].astype(int).to_numpy()
pred_out["predicted_score"] = pred
pred_out.to_csv("predictions.csv", index=False)

pred_out.head()

Unnamed: 0,session_id,product_id,actual_clicked,predicted_score
0,17,prod_17_1,0,0.088004
1,17,prod_17_2,0,0.182742
2,17,prod_17_3,0,-0.29371
3,17,prod_17_4,0,-0.000457
4,17,prod_17_5,0,-0.231431


In [31]:
# --- Robustness checks: position baseline + no-position ablation ---


def _split_feature_sets(feature_cols):
    """Return (position_related, non_position) feature lists."""
    pos_feats = []
    for c in feature_cols:
        if (
            c.startswith("pos_")
            or c.startswith("posb_")
            or ("position_bucket" in c)
            or ("_posb_" in c)
            or (c in {"pos_boost_clipped3", "ctr_prior_cat_x_posb"})
        ):
            pos_feats.append(c)

    pos_feats = sorted(set(pos_feats))
    non_pos = [c for c in feature_cols if c not in set(pos_feats)]
    return pos_feats, non_pos


def run_cv_feature_sets(
    df,
    seeds=(11, 22, 33, 44, 55),
    m=50.0,
    k=5,
    val_fraction=0.1,
):
    rows = []

    for seed in seeds:
        train_full_raw, test_raw = split_by_session(df, seed=seed)
        train_raw, val_raw = split_train_val_from_train(train_full_raw, seed=seed, val_fraction=val_fraction)

        train_fe, val_fe, feats_full, _ = build_features_v4(train_raw, val_raw, m=m, use_ipw=False)
        _, test_fe, _, _ = build_features_v4(train_raw, test_raw, m=m, use_ipw=False)

        pos_feats, feats_no_pos = _split_feature_sets(feats_full)

        # Baselines on TEST
        y = test_fe["clicked"].astype(int).to_numpy()
        sess = test_fe["session_id"].to_numpy()

        score_pos_only = test_fe["pos_boost_clipped3"].to_numpy()
        ndcg_pos_only = ndcg_at_k(y, score_pos_only, sess, k=k, ignore_no_positive=True)

        score_qpr = test_fe["quality_price_ratio"].to_numpy()
        ndcg_qpr = ndcg_at_k(y, score_qpr, sess, k=k, ignore_no_positive=True)

        # Full model (train+val protocol)
        model_full, _ = train_ranker(train_fe, val_fe, feats_full, seed=seed, use_weights=False)
        _, ndcg_full, ndcg_full_all_sessions = eval_ranker(model_full, test_fe, feats_full, k=k)

        # No-position-features model
        model_nopos, _ = train_ranker(train_fe, val_fe, feats_no_pos, seed=seed, use_weights=False)
        _, ndcg_nopos, ndcg_nopos_all_sessions = eval_ranker(model_nopos, test_fe, feats_no_pos, k=k)

        rows.append(
            {
                "seed": seed,
                "ndcg@5_test_pos_only": float(ndcg_pos_only),
                "ndcg@5_test_qpr_only": float(ndcg_qpr),
                "ndcg@5_test_full": float(ndcg_full),
                "ndcg@5_test_no_pos": float(ndcg_nopos),
                "ndcg@5_test_full_all_sessions": float(ndcg_full_all_sessions),
                "ndcg@5_test_no_pos_all_sessions": float(ndcg_nopos_all_sessions),
                "n_features_full": int(len(feats_full)),
                "n_features_no_pos": int(len(feats_no_pos)),
                "pos_feature_count": int(len(pos_feats)),
            }
        )

    out = pd.DataFrame(rows)
    display(out)

    def _mean_std(col):
        return float(out[col].mean()), float(out[col].std())

    print("\nMean ± std (CV, TEST):")
    for col in [
        "ndcg@5_test_pos_only",
        "ndcg@5_test_qpr_only",
        "ndcg@5_test_full",
        "ndcg@5_test_no_pos",
        "ndcg@5_test_full_all_sessions",
        "ndcg@5_test_no_pos_all_sessions",
    ]:
        mu, sd = _mean_std(col)
        print(f"{col}: {mu:.4f} ± {sd:.4f}")

    return out


robust_cv = run_cv_feature_sets(df, seeds=(11, 22, 33, 44, 55), m=50.0, k=5, val_fraction=0.1)




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002151 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3743
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.8132
Early stopping, best iteration is:
[18]	valid_0's ndcg@5: 0.822269




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002420 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3716
[LightGBM] [Info] Number of data points in the train set: 48690, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.748127
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.764754




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002834 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.813272
Early stopping, best iteration is:
[46]	valid_0's ndcg@5: 0.833424




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001733 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3721
[LightGBM] [Info] Number of data points in the train set: 48703, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.753424
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.768621




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002234 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3744
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.825149
Early stopping, best iteration is:
[232]	valid_0's ndcg@5: 0.829503




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002345 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3717
[LightGBM] [Info] Number of data points in the train set: 48658, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.757561
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.779958




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003068 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3748
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.825377
Early stopping, best iteration is:
[27]	valid_0's ndcg@5: 0.83452




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002443 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3721
[LightGBM] [Info] Number of data points in the train set: 48623, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.767273
Early stopping, best iteration is:
[9]	valid_0's ndcg@5: 0.787738




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001753 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3747
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 46
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.828888
Early stopping, best iteration is:
[3]	valid_0's ndcg@5: 0.836103




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002136 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3720
[LightGBM] [Info] Number of data points in the train set: 48694, number of used features: 38
Training until validation scores don't improve for 300 rounds
[300]	valid_0's ndcg@5: 0.764799
Early stopping, best iteration is:
[1]	valid_0's ndcg@5: 0.780051




Unnamed: 0,seed,ndcg@5_test_pos_only,ndcg@5_test_qpr_only,ndcg@5_test_full,ndcg@5_test_no_pos,ndcg@5_test_full_all_sessions,ndcg@5_test_no_pos_all_sessions,n_features_full,n_features_no_pos,pos_feature_count
0,11,0.628691,0.518285,0.653811,0.539937,0.3077,0.254108,46,38,8
1,22,0.616157,0.496397,0.640609,0.521636,0.297082,0.241909,46,38,8
2,33,0.63101,0.52375,0.649975,0.565795,0.320925,0.279361,46,38,8
3,44,0.627556,0.497434,0.658665,0.523201,0.314924,0.250156,46,38,8
4,55,0.609911,0.508788,0.630915,0.547167,0.297319,0.257852,46,38,8



Mean ± std (CV, TEST):
ndcg@5_test_pos_only: 0.6227 ± 0.0091
ndcg@5_test_qpr_only: 0.5089 ± 0.0122
ndcg@5_test_full: 0.6468 ± 0.0111
ndcg@5_test_no_pos: 0.5395 ± 0.0183
ndcg@5_test_full_all_sessions: 0.3076 ± 0.0106
ndcg@5_test_no_pos_all_sessions: 0.2567 ± 0.0140
