# Building Model-Ready Feature Matrices (`X_df`)

This notebook transforms `df_clean` into **numerical feature matrices**
suitable for machine learning models.

Two feature matrices are constructed to handle differences in metric
availability across NBA eras:

- **Era-wide matrix (1996–2024)**  
  Uses only internally computed, season-relative percentiles to ensure
  full historical coverage and temporal consistency.

- **Modern-era matrix (≥2014)**  
  Extends the base feature set with external advanced metrics
  (RAPTOR, LEBRON, MAMBA) when available.

All operations are purely feature-oriented:
no award-specific filtering, labeling, or modeling assumptions
are introduced at this stage.

**Outputs:**
- `X_df_era.parquet`
- `X_df_modern.parquet`
- CSV files documenting the selected feature lists


In [16]:
from pathlib import Path
import pandas as pd
import numpy as np

# ------------------------------------------------------------------
# Project root (robust, no pyproject.toml needed)
# Looks for common "project markers": .git, data/, notebooks/, README.md
# ------------------------------------------------------------------

NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR

MARKERS = [
    ".git",
    "data",
    "notebooks",
    "README.md",
]

def is_project_root(p: Path) -> bool:
    return any((p / m).exists() for m in MARKERS)

while not is_project_root(PROJECT_ROOT):
    if PROJECT_ROOT.parent == PROJECT_ROOT:
        raise RuntimeError(
            "Project root not found. Run this notebook from inside the repo "
            "(a folder containing one of: .git, data/, notebooks/, README.md)."
        )
    PROJECT_ROOT = PROJECT_ROOT.parent

DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
DATA_PROCESSED_FINAL = DATA_PROCESSED / "players" / "final"
DATA_INTERIM = PROJECT_ROOT / "data" / "interim"
DATA_INTERIM.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_PROCESSED:", DATA_PROCESSED)


PROJECT_ROOT: c:\Users\Luc\Documents\projets-data\nba-awards-predictor
DATA_PROCESSED: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\processed


In [17]:
# -----------------------------
# Load df_clean produced by notebook 02
# -----------------------------
DF_CLEAN_PATH = DATA_INTERIM / "df_clean.parquet"
df_clean = pd.read_parquet(DF_CLEAN_PATH)
print("df_clean shape:", df_clean.shape)


df_clean shape: (13842, 427)


## 1) Define feature families

We use **percentiles** as the primary modeling representation because:
- they are season-normalized,
- reduce era drift,
- keep feature scale consistent.


In [18]:
import re

def cols_starting_with(df, prefix: str):
    return [c for c in df.columns if c.startswith(prefix)]

# Base percentiles (from Basketball-Reference derived tables)
BASE_PCT_FEATURES = cols_starting_with(df_clean, "pct_")

# Remove outcome-derived percentiles
BASE_PCT_FEATURES = [c for c in BASE_PCT_FEATURES if not re.match(r"^pct_is_.*winner$", c)]
print("Base pct features:", len(BASE_PCT_FEATURES))


Base pct features: 144


## 2) External metric percentiles (modern coverage)

We only add **percentiles** for external metrics (not raw values nor ranks).


In [19]:
EXTERNAL_PCT_COLS = [
    # RAPTOR
    "pct_raptor__raptor_total",
    "pct_raptor__raptor_offense",
    "pct_raptor__raptor_defense",
    "pct_raptor__war_total",
    # LEBRON
    "pct_lebron__LEBRON",
    "pct_lebron__O-LEBRON",
    "pct_lebron__D-LEBRON",
    # MAMBA
    "pct_mamba__MAMBA",
    "pct_mamba__O-MAMBA",
    "pct_mamba__D-MAMBA",
]
EXTERNAL_PCT_COLS = [c for c in EXTERNAL_PCT_COLS if c in df_clean.columns]
print("External pct features (existing):", len(EXTERNAL_PCT_COLS))
print("External pct features list:", EXTERNAL_PCT_COLS)


External pct features (existing): 0
External pct features list: []


## 3) Build do-not-use-in-X list (leakage + identifiers)


In [20]:
LEAK_PATTERNS = [
    r".*_rank$",
    r"^is_.*",
    r"^all_nba_.*", r"^all_def_.*", r"^all_rookie_.*",
    r"^has_.*consideration$",
    r"^pct_is_.*winner$",
    r"^pct_is_.*_winner$",
]
def is_leak_col(c: str) -> bool:
    return any(re.match(p, c) for p in LEAK_PATTERNS)

IDENTIFIER_COLS = {
    "Player", "player_key", "player_name_raw", "PLAYER_ID", "PLAYER_NAME",
    "raptor__nba_id", "raptor__player_id", "raptor__player_name",
    "lebron__nba_id", "lebron__player_name", "lebron__Season",
    "mamba__nba_id", "mamba__player_name",
}
NON_TABULAR_COLS = {"teams_list", "minutes_list"}
EXCLUDED_CATEGORICAL = {"Team"}   # we keep it in df_clean, but don't model it (for now)

DROP_FROM_X = {
    c for c in df_clean.columns
    if is_leak_col(c) or c in IDENTIFIER_COLS or c in NON_TABULAR_COLS or c in EXCLUDED_CATEGORICAL
}
print("Columns excluded from X (any reason):", len(DROP_FROM_X))


Columns excluded from X (any reason): 54


## 4) Construct X_df_era and X_df_modern

- `X_df_era`: base percentiles + `Pos` (categorical)
- `X_df_modern`: same, but the external metric percentiles will be **used only for modern-era samples** (≥2014) during modeling.

We keep *one* table per variant to make the modeling notebook simple.


In [21]:
# Keep only the columns we intend to use (+ Pos for one-hot)
CORE_COLS = ["Pos"] if "Pos" in df_clean.columns else []
X_base = df_clean[CORE_COLS + BASE_PCT_FEATURES].copy()

# Modern X has extra columns (even if many NaN pre-2014)
X_modern = df_clean[CORE_COLS + BASE_PCT_FEATURES + EXTERNAL_PCT_COLS].copy()

X_df_era = X_base
X_df_modern = X_modern

print("X_df_era shape:", X_df_era.shape)
print("X_df_modern shape:", X_df_modern.shape)


X_df_era shape: (13842, 145)
X_df_modern shape: (13842, 145)


## 5) Quick audit (types, missingness)


In [22]:
def audit_X(X: pd.DataFrame, name: str):
    obj_cols = [c for c in X.columns if X[c].dtype == "object"]
    array_like = []
    for c in obj_cols:
        sample = X[c].dropna().head(50)
        if any(isinstance(v, (list, tuple, dict, set, np.ndarray)) for v in sample):
            array_like.append(c)

    numeric_cols = [c for c in X.columns if c not in obj_cols]
    all_nan_numeric = [c for c in numeric_cols if X[c].isna().all()]

    print(f"== {name} ==")
    print("Object cols:", len(obj_cols))
    print("Array-like object cols (should be 0):", array_like)
    print("All-NaN numeric cols:", len(all_nan_numeric))
    if all_nan_numeric:
        print("Example:", all_nan_numeric[:10])

audit_X(X_df_era, "X_df_era")
audit_X(X_df_modern, "X_df_modern")


== X_df_era ==


Object cols: 1
Array-like object cols (should be 0): []
All-NaN numeric cols: 0
== X_df_modern ==
Object cols: 1
Array-like object cols (should be 0): []
All-NaN numeric cols: 0


## 6) Save feature matrices and column lists


In [23]:
X_ERA_PATH = DATA_INTERIM / "X_df_era.parquet"
X_MOD_PATH = DATA_INTERIM / "X_df_modern.parquet"
ERA_COLS_PATH = DATA_INTERIM / "features_era_pct_only.csv"
MOD_COLS_PATH = DATA_INTERIM / "features_modern_pct_plus_externals.csv"

X_df_era.to_parquet(X_ERA_PATH, index=False)
X_df_modern.to_parquet(X_MOD_PATH, index=False)

pd.Series(X_df_era.columns, name="feature").to_csv(ERA_COLS_PATH, index=False)
pd.Series(X_df_modern.columns, name="feature").to_csv(MOD_COLS_PATH, index=False)

print("Saved:", X_ERA_PATH)
print("Saved:", X_MOD_PATH)
print("Saved:", ERA_COLS_PATH)
print("Saved:", MOD_COLS_PATH)


Saved: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\interim\X_df_era.parquet
Saved: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\interim\X_df_modern.parquet
Saved: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\interim\features_era_pct_only.csv
Saved: c:\Users\Luc\Documents\projets-data\nba-awards-predictor\data\interim\features_modern_pct_plus_externals.csv
