# Notebook 02 — Talent Score v1 (Interpretable) + PCA Benchmark

**Project:** Risk-Adjusted Scouting  
**Last updated:** 2026-02-27

## Objective
Build an initial **Talent Score v1** for young **LW / RW / AM (18–25)** using **FBref Standard** metrics integrated with Transfermarkt IDs in DuckDB.

We implement:
1. **Interpretable weighted z-score composite** (baseline scout-friendly score)
2. **PCA 1-component score** as a technical benchmark
3. Minimal validation checks and exports (Top 50 ranking)

## Data Source
DuckDB table:
- `fact_player_season_fbref_tm`

> Notes  
- Coverage is not 100% by design; integration is entity-resolution based and auditable.
- This notebook is intentionally simple and explainable; richer features arrive after adding FBref Shooting/Passing/Possession/GCA.


In [None]:
from pathlib import Path
import duckdb
import pandas as pd
import numpy as np

# ML
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

# --- Robust project root detection ---
PROJECT_ROOT = Path.cwd().parent  # because notebook lives in /notebooks
DB_PATH = PROJECT_ROOT / "db" / "scouting.duckdb"

assert DB_PATH.exists(), f"DuckDB not found at: {DB_PATH.resolve()}"

SEASON = "2023-2024"
MIN_MINUTES = 900
AGE_MIN, AGE_MAX = 18, 25
TARGET_POS = {"LW", "RW", "AM"}

print("Using DB:", DB_PATH)


In [None]:
con = duckdb.connect(str(DB_PATH))

df = con.execute("""
SELECT *
FROM fact_player_season_fbref_tm
WHERE season = ?
""", [SEASON]).fetchdf()

con.close()

print("Loaded rows:", len(df))
display(df.head(3))


## Define scouting universe

Filters:
- Season: `2023–2024`
- Age: 18–25
- Minutes: ≥ 900
- Positions: LW / RW / AM (using primary position from FBref `pos` field)


In [None]:
# --- Define scouting universe (FBref positions) ---
df = df.copy()

# Basic filters
df = df[(df["age"] >= AGE_MIN) & (df["age"] <= AGE_MAX)]
df["playing_time_min"] = pd.to_numeric(df["playing_time_min"], errors="coerce")
df = df[df["playing_time_min"] >= MIN_MINUTES]

# FBref: pos can be "MF,FW", "FW,MF", etc.
pos_clean = df["pos"].astype(str).str.replace(" ", "")
df["is_fw"] = pos_clean.str.contains("FW")
df["is_mf"] = pos_clean.str.contains("MF")

# v1 attacking universe: any MF or FW (excludes pure DF/GK)
df = df[df["is_fw"] | df["is_mf"]].reset_index(drop=True)

print("Universe rows:", len(df))
print("pos breakdown (top 15):")
display(df["pos"].astype(str).value_counts().head(15))
display(df[["player","squad","comp","age","playing_time_min","pos","is_fw","is_mf"]].head(10))

## Feature selection (Talent Score v1)

We keep v1 intentionally **compact and explainable**:

- **Quality (per90):**
  - `per_90_minutes_gls` (goals/90)
  - `per_90_minutes_ast` (assists/90)

- **Contribution (season totals):**
  - `performance_gls`
  - `performance_ast`

- **Robustness / trust proxy:**
  - `playing_time_min`

> Later iterations will incorporate richer creation and ball progression features once additional FBref tables are integrated.


In [None]:
features = [
    "per_90_minutes_gls",
    "per_90_minutes_ast",
    "performance_gls",
    "performance_ast",
    "playing_time_min",
]

missing = [c for c in features if c not in df.columns]
assert not missing, f"Missing required columns: {missing}"

# Keep only relevant columns + identifiers
id_cols = ["player_id", "club_id", "player", "squad", "comp", "age", "pos", "pos_primary", "season"]
work = df[id_cols + features].copy()

# Basic cleanup
for c in features:
    work[c] = pd.to_numeric(work[c], errors="coerce")

before = len(work)
work = work.dropna(subset=features).reset_index(drop=True)
print(f"Dropped rows with missing feature values: {before - len(work)}")
print("Rows available for scoring:", len(work))

display(work.head(5))


## Talent Score v1 — Weighted z-score composite

We standardize features (z-score) and compute a weighted sum.

Baseline weights (tunable):
- Goals/90: 0.30
- Assists/90: 0.30
- Season goals: 0.15
- Season assists: 0.15
- Minutes: 0.10


In [None]:
# Standardize
scaler = StandardScaler()
Z = scaler.fit_transform(work[features])

z_cols = [f"{c}_z" for c in features]
Z = pd.DataFrame(Z, columns=z_cols)
scored = pd.concat([work, Z], axis=1)

weights = {
    "per_90_minutes_gls_z": 0.30,
    "per_90_minutes_ast_z": 0.30,
    "performance_gls_z": 0.15,
    "performance_ast_z": 0.15,
    "playing_time_min_z": 0.10,
}

# Safety: ensure weights sum to 1
w_sum = sum(weights.values())
assert abs(w_sum - 1.0) < 1e-9, f"Weights must sum to 1. Found {w_sum}"

scored["talent_score_v1"] = sum(scored[col] * w for col, w in weights.items())

display(scored[["player","squad","age","pos_primary","playing_time_min","talent_score_v1"]].sort_values("talent_score_v1", ascending=False).head(20))


## PCA Benchmark (1 component)

We compute a 1-component PCA score on the standardized features and compare it to the interpretable score.

This is not intended as the final scoring method—it's a **technical benchmark** to validate whether the linear weighted score aligns with the dominant latent structure in the data.


In [None]:
pca = PCA(n_components=1, random_state=42)
pc1 = pca.fit_transform(scored[z_cols].values).reshape(-1)

scored["talent_score_pca"] = pc1

print("Explained variance ratio (PC1):", float(pca.explained_variance_ratio_[0]))

corr = scored[["talent_score_v1","talent_score_pca"]].corr().iloc[0,1]
print("Correlation (v1 vs PCA):", float(corr))

# If PCA direction is inverted, correlation might be negative; align sign for ranking comparison
if corr < 0:
    scored["talent_score_pca"] = -scored["talent_score_pca"]
    corr2 = scored[["talent_score_v1","talent_score_pca"]].corr().iloc[0,1]
    print("Correlation after sign alignment:", float(corr2))

display(scored[["player","squad","age","pos_primary","talent_score_v1","talent_score_pca"]].sort_values("talent_score_pca", ascending=False).head(20))


## Position-adjusted Talent Score

In [None]:
# --- Position-adjusted score (within pos_primary) ---
# This removes role bias (FW vs MF) and makes rankings comparable across roles.

pos_grp = "pos_primary"

scored["talent_score_pos_adj"] = (
    scored.groupby(pos_grp)["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 15 — position-adjusted (overall):")
display(
    scored.sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(15)
)

print("\nTop 10 — MF only:")
display(
    scored[scored["pos_primary"]=="MF"]
    .sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(10)
)

print("\nTop 10 — FW only:")
display(
    scored[scored["pos_primary"]=="FW"]
    .sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(10)
)

## League-adjusted Talent Score

In [None]:
# --- League-adjusted score (within comp) ---
# Controls for league environment differences (pace, dominance, average scoring).

league_grp = "comp"

scored["talent_score_league_adj"] = (
    scored.groupby(league_grp)["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 15 — league-adjusted:")
display(
    scored.sort_values("talent_score_league_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min","talent_score_v1","talent_score_league_adj"]
    ].head(15)
)

## Position + League adjusted

In [None]:
# --- Position + league adjustment ---
# Standardize within (comp, pos_primary) for the fairest comparison.

scored["talent_score_comp_pos_adj"] = (
    scored.groupby(["comp","pos_primary"])["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 20 — league+position adjusted:")
display(
    scored.sort_values("talent_score_comp_pos_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min",
         "talent_score_v1","talent_score_pos_adj","talent_score_league_adj","talent_score_comp_pos_adj"]
    ].head(20)
)

## Comparison of rankings

In [None]:
def top_players(df, score_col, n=30):
    return df.sort_values(score_col, ascending=False)["player"].head(n).tolist()

top_v1 = set(top_players(scored, "talent_score_v1", 30))
top_pos = set(top_players(scored, "talent_score_pos_adj", 30))
top_league = set(top_players(scored, "talent_score_league_adj", 30))
top_comp_pos = set(top_players(scored, "talent_score_comp_pos_adj", 30))

print("Top30 overlap (v1 vs pos_adj):", len(top_v1 & top_pos))
print("Top30 overlap (v1 vs league_adj):", len(top_v1 & top_league))
print("Top30 overlap (v1 vs comp_pos_adj):", len(top_v1 & top_comp_pos))

# Show biggest movers: comp+pos adjusted vs v1 rank
tmp = scored.copy()
tmp["rank_v1"] = tmp["talent_score_v1"].rank(ascending=False, method="min")
tmp["rank_comp_pos"] = tmp["talent_score_comp_pos_adj"].rank(ascending=False, method="min")
tmp["rank_delta_comp_pos_minus_v1"] = tmp["rank_comp_pos"] - tmp["rank_v1"]

print("\nBiggest risers (comp+pos adjusted):")
display(
    tmp.sort_values("rank_delta_comp_pos_minus_v1")[[
        "player","squad","comp","pos_primary","playing_time_min",
        "talent_score_v1","talent_score_comp_pos_adj","rank_v1","rank_comp_pos","rank_delta_comp_pos_minus_v1"
    ]].head(15)
)

print("\nBiggest fallers (comp+pos adjusted):")
display(
    tmp.sort_values("rank_delta_comp_pos_minus_v1", ascending=False)[[
        "player","squad","comp","pos_primary","playing_time_min",
        "talent_score_v1","talent_score_comp_pos_adj","rank_v1","rank_comp_pos","rank_delta_comp_pos_minus_v1"
    ]].head(15)
)

## QA: correlation with minutes (before and after)

In [None]:
def corr_with_minutes(df, score_col):
    return df[[score_col, "playing_time_min"]].corr().iloc[0,1]

for col in ["talent_score_v1", "talent_score_pos_adj", "talent_score_league_adj", "talent_score_comp_pos_adj"]:
    print(col, "corr(score, minutes) =", float(corr_with_minutes(scored, col)))

## Export outputs

We export:
- Top 50 ranking (Talent Score v1)  
- A compact dataset with both scores for downstream use

Files:
- `reports/tables/talent_score_v1_top50.csv`
- `reports/tables/talent_score_v1_scored_universe.csv`


In [None]:
PROJECT_ROOT = Path.cwd().parent
REPORTS_DIR = PROJECT_ROOT / "reports" / "tables"
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

export_cols = [
    "season","player_id","club_id","player","squad","comp","age","pos_primary","playing_time_min",
    "per_90_minutes_gls","per_90_minutes_ast","performance_gls","performance_ast",
    "talent_score_v1","talent_score_pca",
    "talent_score_pos_adj","talent_score_league_adj","talent_score_comp_pos_adj",
]

top50 = scored.sort_values("talent_score_comp_pos_adj", ascending=False).loc[:, export_cols].head(50)

out_top50 = REPORTS_DIR / "talent_score_v1_top50_comp_pos_adj.csv"
top50.to_csv(out_top50, index=False)

out_full = REPORTS_DIR / "talent_score_v1_scored_universe.csv"
scored.to_csv(out_full, index=False)

print("Saved:", out_top50)
print("Saved:", out_full)

display(top50.head(10))


## Quick QA checks (sanity)

A few fast checks you can mention in interviews:
- Distribution of scores
- Minutes vs score relationship (should not be extreme)
- Top players are plausible (domain sense check)


In [None]:
summary = scored[["talent_score_v1","talent_score_pca","playing_time_min"]].describe().T
display(summary)

# Correlation with minutes (sanity: should be mild because minutes weight is small)
corr_minutes = scored[["talent_score_v1","playing_time_min"]].corr().iloc[0,1]
print("Correlation (talent_score_v1 vs minutes):", float(corr_minutes))


# Conclusion — Talent Score v1 (Interpretability + Robustness Check)

This notebook implemented an interpretable Talent Score (v1) for attacking profiles (MF/FW, 18–25 years, ≥900 minutes) using per-90 offensive production and a light minutes adjustment.

Key outcomes:

* The handcrafted score is **highly aligned with PCA structure** (≈0.99 correlation), validating that the weighted design captures the dominant variance in the data.
* League and position adjustments confirm that **positional bias is stronger than league bias**, with meaningful but controlled changes in Top 30 overlap.
* Correlation with minutes remains **moderate (~0.38–0.41)**, indicating that playing time influences ranking but does not dominate the score.
* Top-ranked players are domain-plausible, and contextual standardization highlights “within-league” and “within-role” standouts.

Overall, Talent Score v1 is:

* Interpretable
* Statistically coherent
* Context-aware
* Portfolio-ready

The next step is to extend this framework into a **risk-adjusted decision layer**, incorporating age trajectory, volatility, and budget constraints.