# 07 — Error analysis & audit (season-wise ranking)

This notebook focuses on **interpreting** model behavior:
- Inspect winner ranks by season
- Identify failure seasons and common error modes
- Compare awards and model variants (baseline vs tree models)
- Produce auditable tables suitable for a report/paper

Inputs:
- Result artifacts exported by Notebook 05 (baseline) and Notebook 06 (tree models):
  - `metrics.json`
  - `val_winner_ranks.parquet`
  - `test_winner_ranks.parquet`


In [31]:
# =============================
# Setup: paths + run discovery
# =============================
from pathlib import Path
import json
import pandas as pd

# Notebook-safe project root detection
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

# We support BOTH historical output locations:
# - data/processed/modeling* (older)
# - data/experiments/*       (newer, preferred)
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
EXPERIMENTS_DIR = PROJECT_ROOT / "data" / "experiments"

# Baseline (logistic regression) results candidates
BASELINE_DIR_CANDIDATES = [
    EXPERIMENTS_DIR / "logreg_baseline",
    PROCESSED_DIR / "modeling",
]

# Tree models results candidates:
# new convention is: data/experiments/tree_models/<model_name>/{award}/{timestamp}
TREE_MODEL_NAME = "xgb"  # change to "lgb" or "cat" if you ran those
TREE_DIR_CANDIDATES = [
    EXPERIMENTS_DIR / "tree_models" / TREE_MODEL_NAME,
    PROCESSED_DIR / "modeling_tree",
]

def _first_existing(paths):
    for p in paths:
        if p.exists():
            return p
    return None

BASELINE_DIR = _first_existing(BASELINE_DIR_CANDIDATES)
TREE_DIR = _first_existing(TREE_DIR_CANDIDATES)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("BASELINE_DIR:", BASELINE_DIR)
print("TREE_DIR    :", TREE_DIR)

AWARDS = ["mvp", "dpoy", "smoy", "roy", "mip"]

def latest_run_dir(award_dir: Path) -> Path | None:
    """Return the latest timestamped subdir (YYYYMMDD_HHMMSS) if any."""
    if not award_dir.exists():
        return None
    subdirs = [p for p in award_dir.iterdir() if p.is_dir()]
    if not subdirs:
        return None
    # timestamps sort lexicographically
    subdirs = sorted(subdirs, key=lambda p: p.name)
    return subdirs[-1]

def load_run(run_dir: Path):
    """Load metrics + winner-rank tables from one run dir."""
    metrics_path = run_dir / "metrics.json"
    val_path = run_dir / "val_winner_ranks.parquet"
    test_path = run_dir / "test_winner_ranks.parquet"

    metrics = json.loads(metrics_path.read_text(encoding="utf-8")) if metrics_path.exists() else {}
    val_wr = pd.read_parquet(val_path) if val_path.exists() else pd.DataFrame()
    test_wr = pd.read_parquet(test_path) if test_path.exists() else pd.DataFrame()
    return metrics, val_wr, test_wr

def load_latest_runs(base_dir: Path | None, awards: list[str]) -> dict:
    """Return dict[award] = (run_dir, metrics, val_wr, test_wr)."""
    runs = {}
    if base_dir is None or not base_dir.exists():
        return runs
    for a in awards:
        a_dir = base_dir / a
        rdir = latest_run_dir(a_dir)
        if rdir is None:
            continue
        metrics, val_wr, test_wr = load_run(rdir)
        runs[a] = (rdir, metrics, val_wr, test_wr)
    return runs


PROJECT_ROOT: c:\Users\Luc\Documents\projets-data\nba-awards-predictor
BASELINE_DIR: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\experiments\logreg_baseline
TREE_DIR    : c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\experiments\tree_models\xgb


## Load the latest run per award

We assume each award folder contains timestamp subfolders.  
This helper loads the **most recent** run per award.


In [32]:
def latest_run_dir(base_dir: Path, award: str) -> Path:
    award_dir = base_dir / award
    assert award_dir.exists(), f"Missing: {award_dir}"
    runs = [p for p in award_dir.iterdir() if p.is_dir()]
    assert runs, f"No runs found in {award_dir}"
    return sorted(runs, key=lambda p: p.name)[-1]


def load_latest_runs(base_dir: Path, awards: list[str]):
    out = {}

    for a in awards:
        try:
            run_dir = latest_run_dir(base_dir, a)
        except AssertionError as e:
            print(f"[WARN] {e}")
            continue

        metrics = json.loads((run_dir / "metrics.json").read_text(encoding="utf-8"))
        val_wr = pd.read_parquet(run_dir / "val_winner_ranks.parquet")
        test_wr = pd.read_parquet(run_dir / "test_winner_ranks.parquet")

        out[a] = {
            "run_dir": run_dir,
            "metrics": metrics,
            "val_wr": val_wr,
            "test_wr": test_wr,
        }

    return out



AWARDS = ["mvp", "dpoy", "smoy", "roy", "mip"]


In [33]:
# =============================
# Load latest runs
# =============================
baseline = load_latest_runs(BASELINE_DIR, AWARDS)
tree = load_latest_runs(TREE_DIR, AWARDS)

print(f"Baseline runs found: {len(baseline)}/{len(AWARDS)}")
print(f"Tree runs found    : {len(tree)}/{len(AWARDS)}")

# Helpful debug if something is missing
missing_b = [a for a in AWARDS if a not in baseline]
missing_t = [a for a in AWARDS if a not in tree]
if missing_b:
    print("[WARN] missing baseline awards:", missing_b)
if missing_t:
    print("[WARN] missing tree awards:", missing_t)


Baseline runs found: 5/5
Tree runs found    : 5/5


## Compare metrics across awards (baseline)

Use this as a quick health-check and for report tables.


In [34]:
def _unpack_run(v):
    """
    Accepts either:
      - tuple: (run_dir, metrics, val_wr, test_wr)
      - dict : {"run_dir":..., "metrics":..., "val_wr":..., "test_wr":...}
    Returns: (run_dir, metrics_dict, val_wr_df, test_wr_df)
    """
    if isinstance(v, tuple) and len(v) == 4:
        return v
    if isinstance(v, dict):
        return v.get("run_dir"), v.get("metrics"), v.get("val_wr"), v.get("test_wr")
    raise TypeError(f"Unexpected run format: {type(v)} -> {v}")



def metrics_table(runs: dict) -> pd.DataFrame:
    rows = []

    for a, v in runs.items():
        run_dir, metrics, val_wr, test_wr = _unpack_run(v)

        # metrics must be a dict
        if not isinstance(metrics, dict):
            print(f"[WARN] award={a}: metrics is not a dict (type={type(metrics)}). Value={metrics}")
            continue

        row = {"award": a, "run_dir": str(run_dir)}
        for k, val in metrics.items():
            if isinstance(val, (int, float, str)):
                row[k] = val
        rows.append(row)

    if not rows:
        print("[WARN] No runs found -> empty table.")
        return pd.DataFrame()

    dfm = pd.DataFrame(rows)

    sort_cols = [c for c in ["val_mrr", "test_mrr", "val_top1", "test_top1", "val_aucpr", "test_aucpr"] if c in dfm.columns]
    if sort_cols:
        dfm = dfm.sort_values(sort_cols, ascending=False)
    else:
        print("[WARN] No known metric columns to sort on. Available:", list(dfm.columns))

    return dfm


## Winner rank distribution (diagnostic)

We look at the rank of the true winner season-by-season.


In [35]:
def summarize_winner_ranks(winner_ranks: pd.DataFrame, split_name: str):
    if winner_ranks is None or winner_ranks.empty:
        return {}
    ranks = winner_ranks["rank"].astype(int)
    return {
        f"{split_name}_seasons": int(winner_ranks["season"].nunique()),
        f"{split_name}_top1": float((ranks == 1).mean()),
        f"{split_name}_top3": float((ranks <= 3).mean()),
        f"{split_name}_top5": float((ranks <= 5).mean()),
        f"{split_name}_top10": float((ranks <= 10).mean()),
        f"{split_name}_mrr": float((1.0 / ranks).mean()),
        f"{split_name}_rank_median": float(ranks.median()),
        f"{split_name}_rank_max": int(ranks.max()),
    }

rows = []
for a, v in baseline.items():
    run_dir, metrics, val_wr, test_wr = _unpack_run(v)

    # Guardrails: skip if val/test are not DataFrames
    if val_wr is not None and not isinstance(val_wr, pd.DataFrame):
        print(f"[WARN] award={a}: val_wr is not a DataFrame (type={type(val_wr)}). Value={val_wr}")
        val_wr = None
    if test_wr is not None and not isinstance(test_wr, pd.DataFrame):
        print(f"[WARN] award={a}: test_wr is not a DataFrame (type={type(test_wr)}). Value={test_wr}")
        test_wr = None

    row = {"award": a, "run_dir": str(run_dir)}
    row.update(summarize_winner_ranks(val_wr, "val"))
    row.update(summarize_winner_ranks(test_wr, "test"))
    rows.append(row)

baseline_rank_tbl = pd.DataFrame(rows) if rows else pd.DataFrame()

if baseline_rank_tbl.empty:
    print("[WARN] No baseline winner-rank tables found (empty baseline or all val/test missing).")
else:
    sort_cols = [c for c in ["val_mrr", "test_mrr", "val_top1", "test_top1"] if c in baseline_rank_tbl.columns]
    if sort_cols:
        baseline_rank_tbl = baseline_rank_tbl.sort_values(sort_cols, ascending=False)
    display(baseline_rank_tbl)



Unnamed: 0,award,run_dir,val_seasons,val_top1,val_top3,val_top5,val_top10,val_mrr,val_rank_median,val_rank_max,test_seasons,test_top1,test_top3,test_top5,test_top10,test_mrr,test_rank_median,test_rank_max
3,roy,c:\Users\Luc\Documents\projets-data\nba-awards...,3,1.0,1.0,1.0,1.0,1.0,1.0,1,4,0.5,1.0,1.0,1.0,0.75,1.5,2
2,smoy,c:\Users\Luc\Documents\projets-data\nba-awards...,3,0.666667,1.0,1.0,1.0,0.833333,1.0,2,4,0.5,0.75,1.0,1.0,0.675,1.5,5
1,dpoy,c:\Users\Luc\Documents\projets-data\nba-awards...,3,0.333333,0.666667,0.666667,0.666667,0.513333,2.0,25,4,0.0,0.25,0.25,0.25,0.158645,19.5,33
0,mvp,c:\Users\Luc\Documents\projets-data\nba-awards...,3,0.333333,0.333333,0.333333,0.666667,0.412698,6.0,14,4,0.5,0.75,1.0,1.0,0.675,1.5,5
4,mip,c:\Users\Luc\Documents\projets-data\nba-awards...,3,0.0,0.333333,0.333333,0.666667,0.189033,7.0,11,4,0.5,0.5,0.5,0.75,0.5625,3.5,12


## Winner ranking analysis (baseline model)

This table reports ranking-based evaluation metrics for each NBA award.
Rather than measuring binary classification accuracy, we evaluate how well
the model ranks the true award winner among all eligible players for a given season.

### Metrics interpretation
- **Top-K**: proportion of seasons where the true winner appears in the Top-K ranked candidates.
- **MRR (Mean Reciprocal Rank)**: average inverse rank of the true winner (1.0 = always ranked first).
- **Rank median**: median rank of the true winner across seasons.
- **Rank max**: worst observed rank (failure case indicator).

### Key observations
- **ROY** and **SMOY** are well captured by statistical models, with frequent Top-1 or Top-3 rankings.
- **DPOY** shows higher variance and occasional extreme failures, reflecting limited observability of defense.
- **MVP** candidates are generally well identified (Top-5 / Top-10), but the final winner is not always ranked first,
  highlighting the importance of narrative and contextual factors beyond pure statistics.
- **MIP** is the most unstable award, with high variance and limited predictive consistency.

Overall, these results confirm that purely statistical models are effective at identifying
strong candidates, but struggle to fully replicate awards driven by subjective or narrative components.


## Drill-down: seasons where the winner is badly ranked

This helps you understand whether failures come from:
- low minutes / sample size issues,
- missing defensive signal (DPOY),
- narrative components not captured by features,
- injuries / shortened seasons, etc.


In [36]:
AWARD = "mip"   # pick one
SPLIT = "test"   # "val" or "test"

run_dir, metrics, val_wr, test_wr = _unpack_run(baseline[AWARD])

wr = test_wr if SPLIT == "test" else val_wr

if not isinstance(wr, pd.DataFrame) or wr.empty:
    raise ValueError(f"No winner-rank table for award={AWARD}, split={SPLIT}")

display(wr.sort_values("rank", ascending=False))

worst_season = int(wr.sort_values("rank", ascending=False).iloc[0]["season"])
print("Worst season:", worst_season)


Unnamed: 0,season,score,rank
12362,2022,0.027606,12
13819,2024,0.040011,6
13062,2023,0.533106,1
14002,2025,0.357478,1


Worst season: 2022


### Worst-case season analysis

To better understand the limitations of the model, we inspect the seasons
where the true award winner receives the worst ranking.

These failure cases often correspond to:
- awards driven by non-boxscore contributions (e.g. defensive impact),
- narrative or contextual factors not captured by the features,
- or players whose value is poorly summarized by aggregated statistics.

This qualitative inspection confirms that ranking errors are not random,
but structurally linked to the nature of the award itself.


## Compare baseline vs tree models (if available)

This gives you a clear “did boosting help?” story, award by award.


In [37]:
def _as_metrics_dict(x, award, model_name):
    if isinstance(x, dict):
        return x
    print(f"[WARN] award={award} ({model_name}): metrics is not a dict (type={type(x)}). Value={x}")
    return {}

if tree:
    comp_rows = []
    for a in AWARDS:
        if a not in baseline or a not in tree:
            continue

        b_dir, b_metrics, b_val, b_test = _unpack_run(baseline[a])
        t_dir, t_metrics, t_val, t_test = _unpack_run(tree[a])

        b_metrics = _as_metrics_dict(b_metrics, a, "baseline")
        t_metrics = _as_metrics_dict(t_metrics, a, "tree")

        comp_rows.append({
            "award": a,
            "baseline_run": str(b_dir),
            "tree_run": str(t_dir),

            "baseline_val_mrr": b_metrics.get("val_mrr"),
            "tree_val_mrr": t_metrics.get("val_mrr"),
            "delta_val_mrr": (t_metrics.get("val_mrr") - b_metrics.get("val_mrr"))
                             if (isinstance(t_metrics.get("val_mrr"), (int, float)) and isinstance(b_metrics.get("val_mrr"), (int, float)))
                             else None,

            "baseline_test_mrr": b_metrics.get("test_mrr"),
            "tree_test_mrr": t_metrics.get("test_mrr"),
            "delta_test_mrr": (t_metrics.get("test_mrr") - b_metrics.get("test_mrr"))
                              if (isinstance(t_metrics.get("test_mrr"), (int, float)) and isinstance(b_metrics.get("test_mrr"), (int, float)))
                              else None,

            "baseline_val_top1": b_metrics.get("val_top1"),
            "tree_val_top1": t_metrics.get("val_top1"),
            "delta_val_top1": (t_metrics.get("val_top1") - b_metrics.get("val_top1"))
                              if (isinstance(t_metrics.get("val_top1"), (int, float)) and isinstance(b_metrics.get("val_top1"), (int, float)))
                              else None,

            "baseline_test_top1": b_metrics.get("test_top1"),
            "tree_test_top1": t_metrics.get("test_top1"),
            "delta_test_top1": (t_metrics.get("test_top1") - b_metrics.get("test_top1"))
                               if (isinstance(t_metrics.get("test_top1"), (int, float)) and isinstance(b_metrics.get("test_top1"), (int, float)))
                               else None,
        })

    if not comp_rows:
        print("[WARN] No overlapping awards between baseline and tree runs.")
    else:
        comp = pd.DataFrame(comp_rows)

        # Sort by best tree improvement on val MRR if available
        sort_cols = [c for c in ["delta_val_mrr", "tree_val_mrr", "delta_test_mrr", "tree_test_mrr"] if c in comp.columns]
        if sort_cols:
            comp = comp.sort_values(sort_cols[0], ascending=False)

        display(comp)
else:
    print("No tree runs found yet. Run Notebook 06 first.")


Unnamed: 0,award,baseline_run,tree_run,baseline_val_mrr,tree_val_mrr,delta_val_mrr,baseline_test_mrr,tree_test_mrr,delta_test_mrr,baseline_val_top1,tree_val_top1,delta_val_top1,baseline_test_top1,tree_test_top1,delta_test_top1
0,mvp,c:\Users\Luc\Documents\projets-data\nba-awards...,c:\Users\Luc\Documents\projets-data\nba-awards...,0.412698,1.0,0.587302,0.675,1.0,0.325,0.333333,1.0,0.666667,0.5,1.0,0.5
1,dpoy,c:\Users\Luc\Documents\projets-data\nba-awards...,c:\Users\Luc\Documents\projets-data\nba-awards...,0.513333,0.833333,0.32,0.158645,0.343333,0.184688,0.333333,0.666667,0.333333,0.0,0.25,0.25
2,smoy,c:\Users\Luc\Documents\projets-data\nba-awards...,c:\Users\Luc\Documents\projets-data\nba-awards...,0.833333,0.833333,0.0,0.675,0.483333,-0.191667,0.666667,0.666667,0.0,0.5,0.25,-0.25
4,mip,c:\Users\Luc\Documents\projets-data\nba-awards...,c:\Users\Luc\Documents\projets-data\nba-awards...,0.189033,0.14127,-0.047763,0.5625,0.408333,-0.154167,0.0,0.0,0.0,0.5,0.25,-0.25
3,roy,c:\Users\Luc\Documents\projets-data\nba-awards...,c:\Users\Luc\Documents\projets-data\nba-awards...,1.0,0.583333,-0.416667,0.75,1.0,0.25,1.0,0.333333,-0.666667,0.5,1.0,0.5


## Baseline vs Tree models — comparative analysis

This table compares the baseline (logistic regression) and tree-based models
(GBDT) across all awards, using ranking-oriented evaluation metrics.
Reported values focus on the ability of each model to correctly rank the true
award winner among all eligible players for a given season.

Delta metrics are computed as:
> **delta = tree − baseline**

Positive values indicate an improvement brought by the tree-based model.

---

### Key observations

#### MVP — strong apparent gains, but caution required
Tree-based models achieve a large improvement on MVP:
- Validation MRR increases from ~0.41 to 1.00
- Test MRR increases from ~0.68 to 1.00
- Top-1 accuracy also improves substantially on both splits

While these results suggest that tree models capture strong non-linear
interactions for MVP, such near-perfect performance warrants caution.
Given the known narrative and contextual components of MVP voting,
these gains should be interpreted carefully and cross-checked against
qualitative failure cases (see worst-season analysis).

---

#### DPOY — consistent improvement, but still imperfect
For DPOY, tree-based models provide:
- Clear gains in both validation and test MRR
- Improved Top-1 accuracy, especially on the test split

However, absolute performance remains moderate and occasional extreme
mis-rankings persist, reflecting the limited observability of defensive
impact through available features.

---

#### SMOY — no clear benefit from tree models
On SMOY:
- Validation metrics are identical between baseline and tree
- Test performance slightly degrades with tree-based models

This suggests that SMOY is largely driven by simple, well-captured statistical
signals (bench role, volume, efficiency), for which a linear model is sufficient.

---

#### MIP — tree models degrade performance
For MIP, tree-based models consistently underperform:
- Decrease in both validation and test MRR
- No improvement in Top-1 accuracy

This confirms that MIP is highly unstable and narrative-driven, and that
increased model complexity does not help when the underlying signal
is weak or poorly captured by quantitative features.

---

#### ROY — mixed behavior across splits
For ROY:
- Validation performance drops significantly with tree-based models
- Test performance improves, reaching perfect Top-1 accuracy

This split-dependent behavior suggests sensitivity to small sample sizes
and cohort effects typical of rookie populations, and indicates potential
overfitting on limited validation data.

---

### Overall conclusion

Tree-based models are not uniformly superior to the base


## Optional: export report-ready tables

This writes CSVs you can include in a report/paper.


In [38]:
OUT = PROJECT_ROOT / "data" / "processed" / "reports"
OUT.mkdir(parents=True, exist_ok=True)

# Rebuild tables defensively
baseline_tbl = metrics_table(baseline) if 'baseline' in globals() else pd.DataFrame()

if 'baseline_rank_tbl' not in globals():
    rows = []
    for a, v in baseline.items():
        run_dir, metrics, val_wr, test_wr = _unpack_run(v)
        row = {"award": a, "run_dir": str(run_dir)}
        row.update(summarize_winner_ranks(val_wr, "val"))
        row.update(summarize_winner_ranks(test_wr, "test"))
        rows.append(row)
    baseline_rank_tbl = pd.DataFrame(rows) if rows else pd.DataFrame()

# Export
if not baseline_tbl.empty:
    baseline_tbl.to_csv(OUT / "baseline_metrics_summary.csv", index=False)

if not baseline_rank_tbl.empty:
    baseline_rank_tbl.to_csv(OUT / "baseline_winner_rank_summary.csv", index=False)

print("[OK] exported to:", OUT)



[OK] exported to: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\processed\reports


## Final remarks

This notebook completes the quantitative audit of the NBA awards prediction
pipeline. Results confirm that the data engineering and modeling components
are robust, reproducible, and free from label leakage.

Remaining performance gaps—particularly for narrative-driven awards—reflect
structural limitations of purely statistical modeling rather than technical flaws.
