# Modeling and Evaluation Pipeline

This notebook trains and evaluates predictive models for NBA awards
using the datasets prepared in earlier steps.

It consumes the following artifacts:
- `df_clean.parquet`
- `X_df_era.parquet` and `X_df_modern.parquet`
- award-specific datasets built in Notebook 04

The focus is on **modeling and evaluation**, not feature construction.

We will implement:
- baseline and advanced models (e.g. Logistic Regression, Gradient Boosting),
- metrics adapted to extreme class imbalance (AUCPR, log loss),
- season-aware ranking evaluation
  (e.g. does the true winner rank #1 or within the top-k each season?).

This notebook serves as the experimental backbone of the project,
enabling fair comparison across modeling choices and feature regimes.


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

# -----------------------------
# Project root + paths
# -----------------------------
PROJECT_ROOT = Path().resolve()
while not (PROJECT_ROOT / "pyproject.toml").exists() and PROJECT_ROOT != PROJECT_ROOT.parent:
    PROJECT_ROOT = PROJECT_ROOT.parent

DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
DATA_RAW = PROJECT_ROOT / "data" / "raw"
OUTPUT_DIR = PROJECT_ROOT / "data" / "interim"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)


In [None]:
# Load artifacts
df_clean = pd.read_parquet(OUTPUT_DIR / "df_clean.parquet")
X_df_era = pd.read_parquet(OUTPUT_DIR / "X_df_era.parquet")
X_df_modern = pd.read_parquet(OUTPUT_DIR / "X_df_modern.parquet")


## TODO

- Define common preprocessing for `Pos` (one-hot) + numeric imputation.
- Implement a common evaluation function:
  - AUC-PR (global)
  - season-wise: rank players by predicted probability and check if winner is in top-1 / top-3 / top-5.
- For modern-era models (â‰¥2014): use `X_df_modern` and drop pre-2014 seasons.
