# Notebook 02 — Talent Score v1 (Interpretable) + PCA Benchmark

**Project:** Risk-Adjusted Scouting  
**Last updated:** 2026-02-27

## Objective
Build an initial **Talent Score v1** for young **LW / RW / AM (18–25)** using **FBref Standard** metrics integrated with Transfermarkt IDs in DuckDB.

We implement:
1. **Interpretable weighted z-score composite** (baseline scout-friendly score)
2. **PCA 1-component score** as a technical benchmark
3. Minimal validation checks and exports (Top 50 ranking)

## Data Source
DuckDB table:
- `fact_player_season_fbref_tm`

> Notes  
- Coverage is not 100% by design; integration is entity-resolution based and auditable.
- This notebook is intentionally simple and explainable; richer features arrive after adding FBref Shooting/Passing/Possession/GCA.


In [1]:
from pathlib import Path
import duckdb
import pandas as pd
import numpy as np

# ML
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

# --- Robust project root detection ---
PROJECT_ROOT = Path.cwd().parent  # because notebook lives in /notebooks
DB_PATH = PROJECT_ROOT / "db" / "scouting.duckdb"

assert DB_PATH.exists(), f"DuckDB not found at: {DB_PATH.resolve()}"

SEASON = "2023-2024"
MIN_MINUTES = 900
AGE_MIN, AGE_MAX = 18, 25
TARGET_POS = {"LW", "RW", "AM"}

print("Using DB:", DB_PATH)


Using DB: c:\Users\manue\Projects\risk-adjusted-scouting\db\scouting.duckdb


In [2]:
con = duckdb.connect(str(DB_PATH))

df = con.execute("""
SELECT *
FROM fact_player_season_fbref_tm
WHERE season = ?
""", [SEASON]).fetchdf()

con.close()

print("Loaded rows:", len(df))
display(df.head(3))


Loaded rows: 1796


Unnamed: 0,rk,player,nation,pos,squad,comp,age,born,playing_time_mp,playing_time_starts,playing_time_min,playing_time_90s,performance_gls,performance_ast,performance_ga,performance_gpk,performance_pk,performance_pkatt,performance_crdy,performance_crdr,per_90_minutes_gls,per_90_minutes_ast,per_90_minutes_ga,per_90_minutes_gpk,per_90_minutes_gapk,season,player_id,club_id,match_method
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,23.0,2000.0,20.0,13.0,1237.0,13.7,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.07,0.07,0.0,0.07,2023-2024,471690,989,alias_clubid_plus_exact_norm_name
1,7,Nabil Aberdin,fr FRA,DF,Getafe,es La Liga,20.0,2002.0,2.0,2.0,180.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2023-2024,908621,3709,alias_clubid_plus_exact_norm_name
2,8,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,30.0,1993.0,33.0,32.0,2860.0,31.8,2.0,1.0,3.0,2.0,0.0,0.0,4.0,0.0,0.06,0.03,0.09,0.06,0.09,2023-2024,238626,1158,alias_clubid_plus_exact_norm_name


## Define scouting universe

Filters:
- Season: `2023–2024`
- Age: 18–25
- Minutes: ≥ 900
- Positions: LW / RW / AM (using primary position from FBref `pos` field)


In [3]:
# --- Define scouting universe (FBref positions) ---
df = df.copy()

# Basic filters
df = df[(df["age"] >= AGE_MIN) & (df["age"] <= AGE_MAX)]
df["playing_time_min"] = pd.to_numeric(df["playing_time_min"], errors="coerce")
df = df[df["playing_time_min"] >= MIN_MINUTES]

# FBref: pos can be "MF,FW", "FW,MF", etc.
pos_clean = df["pos"].astype(str).str.replace(" ", "")
df["is_fw"] = pos_clean.str.contains("FW")
df["is_mf"] = pos_clean.str.contains("MF")

# v1 attacking universe: any MF or FW (excludes pure DF/GK)
df = df[df["is_fw"] | df["is_mf"]].reset_index(drop=True)

print("Universe rows:", len(df))
print("pos breakdown (top 15):")
display(df["pos"].astype(str).value_counts().head(15))
display(df[["player","squad","comp","age","playing_time_min","pos","is_fw","is_mf"]].head(10))

Universe rows: 361
pos breakdown (top 15):


pos
MF       163
MF,FW     67
FW        50
DF,MF     34
FW,MF     24
MF,DF     23
Name: count, dtype: int64

Unnamed: 0,player,squad,comp,age,playing_time_min,pos,is_fw,is_mf
0,Akor Adams,Montpellier,fr Ligue 1,23.0,2252.0,FW,True,False
1,Elijah Adebayo,Luton Town,eng Premier League,25.0,1419.0,FW,True,False
2,Simon Adingra,Brighton,eng Premier League,21.0,2222.0,MF,False,True
3,Yacine Adli,Milan,it Serie A,23.0,1407.0,MF,False,True
4,Rayan Aït-Nouri,Wolves,eng Premier League,22.0,2329.0,"MF,DF",False,True
5,Ilias Akhomach,Villarreal,es La Liga,19.0,1511.0,MF,False,True
6,Sergio Akieme,Almería,es La Liga,25.0,1675.0,"DF,MF",False,True
7,Carles Aleñá,Getafe,es La Liga,25.0,1020.0,MF,False,True
8,Nabil Alioui,Le Havre,fr Ligue 1,24.0,1009.0,"MF,FW",True,True
9,Pontus Almqvist,Lecce,it Serie A,24.0,2107.0,"FW,MF",True,True


## Feature selection (Talent Score v1)

We keep v1 intentionally **compact and explainable**:

- **Quality (per90):**
  - `per_90_minutes_gls` (goals/90)
  - `per_90_minutes_ast` (assists/90)

- **Contribution (season totals):**
  - `performance_gls`
  - `performance_ast`

- **Robustness / trust proxy:**
  - `playing_time_min`

> Later iterations will incorporate richer creation and ball progression features once additional FBref tables are integrated.


In [4]:
# ----------------------------
# Feature selection (Talent Score v1)
# ----------------------------
features = [
    "per_90_minutes_gls",
    "per_90_minutes_ast",
    "performance_gls",
    "performance_ast",
    "playing_time_min",
]

# 1) Validate required feature columns
missing = [c for c in features if c not in df.columns]
assert not missing, f"Missing required columns: {missing}"

# 2) Ensure pos_primary exists (derive if missing)
if "pos_primary" not in df.columns:
    if "pos" in df.columns:
        # FBref often provides positions like "FW,MF" -> take first as primary
        df["pos_primary"] = (
            df["pos"]
            .astype(str)
            .str.split(",")
            .str[0]
            .str.strip()
        )
    elif "position" in df.columns:
        df["pos_primary"] = (
            df["position"]
            .astype(str)
            .str.split(",")
            .str[0]
            .str.strip()
        )
    else:
        raise KeyError(
            "pos_primary is missing and cannot be derived because neither 'pos' nor 'position' exists in df."
        )

# 3) Keep only relevant columns + identifiers (validate IDs exist)
id_cols = ["player_id", "club_id", "player", "squad", "comp", "age", "pos", "pos_primary", "season"]
missing_ids = [c for c in id_cols if c not in df.columns]
assert not missing_ids, f"Missing identifier columns: {missing_ids}"

work = df[id_cols + features].copy()

# 4) Basic cleanup: numeric coercion
for c in features:
    work[c] = pd.to_numeric(work[c], errors="coerce")

# 5) Debug prints (optional but useful)
before = len(work)
print("df shape:", df.shape)
print("pos_primary in df?", "pos_primary" in df.columns)
print("Columns containing 'pos':", [c for c in df.columns if "pos" in c.lower()])
print("First 40 columns:", df.columns.tolist()[:40])

# 6) Drop rows with missing feature values
work = work.dropna(subset=features).reset_index(drop=True)
print(f"Dropped rows with missing feature values: {before - len(work)}")
print("Rows available for scoring:", len(work))

display(work.head(5))

df shape: (361, 32)
pos_primary in df? True
Columns containing 'pos': ['pos', 'pos_primary']
First 40 columns: ['rk', 'player', 'nation', 'pos', 'squad', 'comp', 'age', 'born', 'playing_time_mp', 'playing_time_starts', 'playing_time_min', 'playing_time_90s', 'performance_gls', 'performance_ast', 'performance_ga', 'performance_gpk', 'performance_pk', 'performance_pkatt', 'performance_crdy', 'performance_crdr', 'per_90_minutes_gls', 'per_90_minutes_ast', 'per_90_minutes_ga', 'per_90_minutes_gpk', 'per_90_minutes_gapk', 'season', 'player_id', 'club_id', 'match_method', 'is_fw', 'is_mf', 'pos_primary']
Dropped rows with missing feature values: 0
Rows available for scoring: 361


Unnamed: 0,player_id,club_id,player,squad,comp,age,pos,pos_primary,season,per_90_minutes_gls,per_90_minutes_ast,performance_gls,performance_ast,playing_time_min
0,607034,969,Akor Adams,Montpellier,fr Ligue 1,23.0,FW,FW,2023-2024,0.32,0.04,8.0,1.0,2252.0
1,319900,1031,Elijah Adebayo,Luton Town,eng Premier League,25.0,FW,FW,2023-2024,0.63,0.0,10.0,0.0,1419.0
2,658536,1237,Simon Adingra,Brighton,eng Premier League,21.0,MF,MF,2023-2024,0.24,0.04,6.0,1.0,2222.0
3,395236,5,Yacine Adli,Milan,it Serie A,23.0,MF,MF,2023-2024,0.06,0.13,1.0,2.0,1407.0
4,578391,543,Rayan Aït-Nouri,Wolves,eng Premier League,22.0,"MF,DF",MF,2023-2024,0.08,0.04,2.0,1.0,2329.0


## Talent Score v1 — Weighted z-score composite

We standardize features (z-score) and compute a weighted sum.

Baseline weights (tunable):
- Goals/90: 0.30
- Assists/90: 0.30
- Season goals: 0.15
- Season assists: 0.15
- Minutes: 0.10


In [5]:
# Standardize
scaler = StandardScaler()
Z = scaler.fit_transform(work[features])

z_cols = [f"{c}_z" for c in features]
Z = pd.DataFrame(Z, columns=z_cols)
scored = pd.concat([work, Z], axis=1)

weights = {
    "per_90_minutes_gls_z": 0.30,
    "per_90_minutes_ast_z": 0.30,
    "performance_gls_z": 0.15,
    "performance_ast_z": 0.15,
    "playing_time_min_z": 0.10,
}

# Safety: ensure weights sum to 1
w_sum = sum(weights.values())
assert abs(w_sum - 1.0) < 1e-9, f"Weights must sum to 1. Found {w_sum}"

scored["talent_score_v1"] = sum(scored[col] * w for col, w in weights.items())

display(scored[["player","squad","age","pos_primary","playing_time_min","talent_score_v1"]].sort_values("talent_score_v1", ascending=False).head(20))


Unnamed: 0,player,squad,age,pos_primary,playing_time_min,talent_score_v1
219,Kylian Mbappé,Paris Saint-Germain,24.0,FW,2158.0,2.925381
262,Cole Palmer,Chelsea,21.0,MF,2607.0,2.766822
45,Victor Boniface,Leverkusen,22.0,FW,1546.0,2.503525
142,Erling Haaland,Manchester City,23.0,FW,2552.0,2.286919
254,Loïs Openda,RB Leipzig,23.0,FW,2697.0,2.245995
349,Florian Wirtz,Leverkusen,20.0,MF,2372.0,1.97145
118,Phil Foden,Manchester City,23.0,MF,2857.0,1.923273
249,Michael Olise,Crystal Palace,21.0,MF,1275.0,1.904319
302,Gianluca Scamacca,Atalanta,24.0,FW,1453.0,1.902406
35,Jude Bellingham,Real Madrid,20.0,MF,2315.0,1.872948


## PCA Benchmark (1 component)

We compute a 1-component PCA score on the standardized features and compare it to the interpretable score.

This is not intended as the final scoring method—it's a **technical benchmark** to validate whether the linear weighted score aligns with the dominant latent structure in the data.


In [6]:
pca = PCA(n_components=1, random_state=42)
pc1 = pca.fit_transform(scored[z_cols].values).reshape(-1)

scored["talent_score_pca"] = pc1

print("Explained variance ratio (PC1):", float(pca.explained_variance_ratio_[0]))

corr = scored[["talent_score_v1","talent_score_pca"]].corr().iloc[0,1]
print("Correlation (v1 vs PCA):", float(corr))

# If PCA direction is inverted, correlation might be negative; align sign for ranking comparison
if corr < 0:
    scored["talent_score_pca"] = -scored["talent_score_pca"]
    corr2 = scored[["talent_score_v1","talent_score_pca"]].corr().iloc[0,1]
    print("Correlation after sign alignment:", float(corr2))

display(scored[["player","squad","age","pos_primary","talent_score_v1","talent_score_pca"]].sort_values("talent_score_pca", ascending=False).head(20))


Explained variance ratio (PC1): 0.5529927969713575
Correlation (v1 vs PCA): 0.9932000687096845


Unnamed: 0,player,squad,age,pos_primary,talent_score_v1,talent_score_pca
219,Kylian Mbappé,Paris Saint-Germain,24.0,FW,2.925381,6.340812
262,Cole Palmer,Chelsea,21.0,MF,2.766822,6.333577
254,Loïs Openda,RB Leipzig,23.0,FW,2.245995,5.239658
142,Erling Haaland,Manchester City,23.0,FW,2.286919,5.230227
45,Victor Boniface,Leverkusen,22.0,FW,2.503525,4.91165
118,Phil Foden,Manchester City,23.0,MF,1.923273,4.595732
349,Florian Wirtz,Leverkusen,20.0,MF,1.97145,4.425157
35,Jude Bellingham,Real Madrid,20.0,MF,1.872948,4.189788
215,Lautaro Martínez,Inter,25.0,FW,1.650417,3.886901
135,Anthony Gordon,Newcastle United,22.0,FW,1.564692,3.804228


## Position-adjusted Talent Score

In [7]:
# --- Position-adjusted score (within pos_primary) ---
# This removes role bias (FW vs MF) and makes rankings comparable across roles.

pos_grp = "pos_primary"

scored["talent_score_pos_adj"] = (
    scored.groupby(pos_grp)["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 15 — position-adjusted (overall):")
display(
    scored.sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(15)
)

print("\nTop 10 — MF only:")
display(
    scored[scored["pos_primary"]=="MF"]
    .sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(10)
)

print("\nTop 10 — FW only:")
display(
    scored[scored["pos_primary"]=="FW"]
    .sort_values("talent_score_pos_adj", ascending=False)[
        ["player","squad","comp","age","playing_time_min","talent_score_v1","talent_score_pos_adj"]
    ].head(10)
)

Top 15 — position-adjusted (overall):


Unnamed: 0,player,squad,comp,age,pos_primary,playing_time_min,talent_score_v1,talent_score_pos_adj
262,Cole Palmer,Chelsea,eng Premier League,21.0,MF,2607.0,2.766822,4.208412
349,Florian Wirtz,Leverkusen,de Bundesliga,20.0,MF,2372.0,1.97145,3.048487
118,Phil Foden,Manchester City,eng Premier League,23.0,MF,2857.0,1.923273,2.978228
249,Michael Olise,Crystal Palace,eng Premier League,21.0,MF,1275.0,1.904319,2.950587
35,Jude Bellingham,Real Madrid,es La Liga,20.0,MF,2315.0,1.872948,2.904837
219,Kylian Mbappé,Paris Saint-Germain,fr Ligue 1,24.0,FW,2158.0,2.925381,2.857876
23,Leon Bailey,Aston Villa,eng Premier League,25.0,MF,2068.0,1.716945,2.677332
22,Alex Baena,Villarreal,es La Liga,22.0,MF,2579.0,1.559905,2.448313
311,Xavi Simons,RB Leipzig,de Bundesliga,20.0,MF,2653.0,1.552077,2.436897
45,Victor Boniface,Leverkusen,de Bundesliga,22.0,FW,1546.0,2.503525,2.343962



Top 10 — MF only:


Unnamed: 0,player,squad,comp,age,playing_time_min,talent_score_v1,talent_score_pos_adj
262,Cole Palmer,Chelsea,eng Premier League,21.0,2607.0,2.766822,4.208412
349,Florian Wirtz,Leverkusen,de Bundesliga,20.0,2372.0,1.97145,3.048487
118,Phil Foden,Manchester City,eng Premier League,23.0,2857.0,1.923273,2.978228
249,Michael Olise,Crystal Palace,eng Premier League,21.0,1275.0,1.904319,2.950587
35,Jude Bellingham,Real Madrid,es La Liga,20.0,2315.0,1.872948,2.904837
23,Leon Bailey,Aston Villa,eng Premier League,25.0,2068.0,1.716945,2.677332
22,Alex Baena,Villarreal,es La Liga,22.0,2579.0,1.559905,2.448313
311,Xavi Simons,RB Leipzig,de Bundesliga,20.0,2653.0,1.552077,2.436897
348,Nico Williams,Athletic Club,es La Liga,21.0,2263.0,1.483106,2.336314
277,Christian Pulisic,Milan,it Serie A,24.0,2602.0,1.468153,2.314508



Top 10 — FW only:


Unnamed: 0,player,squad,comp,age,playing_time_min,talent_score_v1,talent_score_pos_adj
219,Kylian Mbappé,Paris Saint-Germain,fr Ligue 1,24.0,2158.0,2.925381,2.857876
45,Victor Boniface,Leverkusen,de Bundesliga,22.0,1546.0,2.503525,2.343962
142,Erling Haaland,Manchester City,eng Premier League,23.0,2552.0,2.286919,2.080088
254,Loïs Openda,RB Leipzig,de Bundesliga,23.0,2697.0,2.245995,2.030234
302,Gianluca Scamacca,Atalanta,it Serie A,24.0,1453.0,1.902406,1.611666
215,Lautaro Martínez,Inter,it Serie A,25.0,2656.0,1.650417,1.304688
246,Darwin Núñez,Liverpool,eng Premier League,24.0,2047.0,1.634349,1.285114
163,Vinicius Júnior,Real Madrid,es La Liga,23.0,1864.0,1.609752,1.255149
84,Charles De Ketelaere,Atalanta,it Serie A,22.0,2026.0,1.566157,1.20204
135,Anthony Gordon,Newcastle United,eng Premier League,22.0,2890.0,1.564692,1.200256


## League-adjusted Talent Score

In [8]:
# --- League-adjusted score (within comp) ---
# Controls for league environment differences (pace, dominance, average scoring).

league_grp = "comp"

scored["talent_score_league_adj"] = (
    scored.groupby(league_grp)["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 15 — league-adjusted:")
display(
    scored.sort_values("talent_score_league_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min","talent_score_v1","talent_score_league_adj"]
    ].head(15)
)

Top 15 — league-adjusted:


Unnamed: 0,player,squad,comp,age,pos_primary,playing_time_min,talent_score_v1,talent_score_league_adj
219,Kylian Mbappé,Paris Saint-Germain,fr Ligue 1,24.0,FW,2158.0,2.925381,4.179576
262,Cole Palmer,Chelsea,eng Premier League,21.0,MF,2607.0,2.766822,3.256699
302,Gianluca Scamacca,Atalanta,it Serie A,24.0,FW,1453.0,1.902406,3.05523
142,Erling Haaland,Manchester City,eng Premier League,23.0,FW,2552.0,2.286919,2.670923
215,Lautaro Martínez,Inter,it Serie A,25.0,FW,2656.0,1.650417,2.669903
35,Jude Bellingham,Real Madrid,es La Liga,20.0,MF,2315.0,1.872948,2.624244
45,Victor Boniface,Leverkusen,de Bundesliga,22.0,FW,1546.0,2.503525,2.606249
84,Charles De Ketelaere,Atalanta,it Serie A,22.0,FW,2026.0,1.566157,2.541056
196,Ademola Lookman,Atalanta,it Serie A,25.0,FW,1894.0,1.554535,2.523285
277,Christian Pulisic,Milan,it Serie A,24.0,MF,2602.0,1.468153,2.391195


## Position + League adjusted

In [9]:
# --- Position + league adjustment ---
# Standardize within (comp, pos_primary) for the fairest comparison.

scored["talent_score_comp_pos_adj"] = (
    scored.groupby(["comp","pos_primary"])["talent_score_v1"]
    .transform(lambda x: (x - x.mean()) / (x.std(ddof=0) if x.std(ddof=0) != 0 else 1.0))
)

print("Top 20 — league+position adjusted:")
display(
    scored.sort_values("talent_score_comp_pos_adj", ascending=False)[
        ["player","squad","comp","age","pos_primary","playing_time_min",
         "talent_score_v1","talent_score_pos_adj","talent_score_league_adj","talent_score_comp_pos_adj"]
    ].head(20)
)

Top 20 — league+position adjusted:


Unnamed: 0,player,squad,comp,age,pos_primary,playing_time_min,talent_score_v1,talent_score_pos_adj,talent_score_league_adj,talent_score_comp_pos_adj
262,Cole Palmer,Chelsea,eng Premier League,21.0,MF,2607.0,2.766822,4.208412,3.256699,3.407091
277,Christian Pulisic,Milan,it Serie A,24.0,MF,2602.0,1.468153,2.314508,2.391195,3.160788
190,Rafael Leão,Milan,it Serie A,24.0,MF,2512.0,1.377479,2.182273,2.25254,2.991558
35,Jude Bellingham,Real Madrid,es La Liga,20.0,MF,2315.0,1.872948,2.904837,2.624244,2.643636
349,Florian Wirtz,Leverkusen,de Bundesliga,20.0,MF,2372.0,1.97145,3.048487,2.00004,2.617927
219,Kylian Mbappé,Paris Saint-Germain,fr Ligue 1,24.0,FW,2158.0,2.925381,2.857876,4.179576,2.438431
118,Phil Foden,Manchester City,eng Premier League,23.0,MF,2857.0,1.923273,2.978228,2.227053,2.366545
249,Michael Olise,Crystal Palace,eng Premier League,21.0,MF,1275.0,1.904319,2.950587,2.203918,2.343164
178,Teun Koopmeiners,Atalanta,it Serie A,25.0,MF,2627.0,0.968762,1.586224,1.627553,2.228754
22,Alex Baena,Villarreal,es La Liga,22.0,MF,2579.0,1.559905,2.448313,2.188898,2.217729


## Comparison of rankings

In [10]:
def top_players(df, score_col, n=30):
    return df.sort_values(score_col, ascending=False)["player"].head(n).tolist()

top_v1 = set(top_players(scored, "talent_score_v1", 30))
top_pos = set(top_players(scored, "talent_score_pos_adj", 30))
top_league = set(top_players(scored, "talent_score_league_adj", 30))
top_comp_pos = set(top_players(scored, "talent_score_comp_pos_adj", 30))

print("Top30 overlap (v1 vs pos_adj):", len(top_v1 & top_pos))
print("Top30 overlap (v1 vs league_adj):", len(top_v1 & top_league))
print("Top30 overlap (v1 vs comp_pos_adj):", len(top_v1 & top_comp_pos))

# Show biggest movers: comp+pos adjusted vs v1 rank
tmp = scored.copy()
tmp["rank_v1"] = tmp["talent_score_v1"].rank(ascending=False, method="min")
tmp["rank_comp_pos"] = tmp["talent_score_comp_pos_adj"].rank(ascending=False, method="min")
tmp["rank_delta_comp_pos_minus_v1"] = tmp["rank_comp_pos"] - tmp["rank_v1"]

print("\nBiggest risers (comp+pos adjusted):")
display(
    tmp.sort_values("rank_delta_comp_pos_minus_v1")[[
        "player","squad","comp","pos_primary","playing_time_min",
        "talent_score_v1","talent_score_comp_pos_adj","rank_v1","rank_comp_pos","rank_delta_comp_pos_minus_v1"
    ]].head(15)
)

print("\nBiggest fallers (comp+pos adjusted):")
display(
    tmp.sort_values("rank_delta_comp_pos_minus_v1", ascending=False)[[
        "player","squad","comp","pos_primary","playing_time_min",
        "talent_score_v1","talent_score_comp_pos_adj","rank_v1","rank_comp_pos","rank_delta_comp_pos_minus_v1"
    ]].head(15)
)

Top30 overlap (v1 vs pos_adj): 20
Top30 overlap (v1 vs league_adj): 24
Top30 overlap (v1 vs comp_pos_adj): 17

Biggest risers (comp+pos adjusted):


Unnamed: 0,player,squad,comp,pos_primary,playing_time_min,talent_score_v1,talent_score_comp_pos_adj,rank_v1,rank_comp_pos,rank_delta_comp_pos_minus_v1
355,Gabriele Zappa,Cagliari,it Serie A,DF,2561.0,-0.245326,1.884691,200.0,23.0,-177.0
192,Pol Lirola,Frosinone,it Serie A,DF,1536.0,-0.516754,0.398981,253.0,111.0,-142.0
91,Josh Doig,Sassuolo,it Serie A,DF,1343.0,-0.538063,0.282345,262.0,122.0,-140.0
264,Felix Passlack,Bochum,de Bundesliga,DF,960.0,-0.078377,1.24752,160.0,46.0,-114.0
318,Josip Stanišić,Leverkusen,de Bundesliga,DF,1261.0,-0.364118,0.255168,224.0,127.0,-97.0
320,Gabriel Suazo,Toulouse,fr Ligue 1,DF,2357.0,-0.198978,0.537197,184.0,94.0,-90.0
333,Tuta,Eintracht Frankfurt,de Bundesliga,DF,2600.0,-0.426339,0.039083,238.0,154.0,-84.0
68,Cristian Cásseres Jr.,Toulouse,fr Ligue 1,MF,1946.0,-0.244972,0.298739,199.0,118.0,-81.0
140,Malo Gusto,Chelsea,eng Premier League,DF,1751.0,0.272228,1.763283,103.0,24.0,-79.0
8,Nabil Alioui,Le Havre,fr Ligue 1,MF,1009.0,-0.088532,0.670824,161.0,82.0,-79.0



Biggest fallers (comp+pos adjusted):


Unnamed: 0,player,squad,comp,pos_primary,playing_time_min,talent_score_v1,talent_score_comp_pos_adj,rank_v1,rank_comp_pos,rank_delta_comp_pos_minus_v1
0,Akor Adams,Montpellier,fr Ligue 1,FW,2252.0,0.038989,-0.975752,136.0,309.0,173.0
106,Emanuel Emegha,Strasbourg,fr Ligue 1,FW,2078.0,0.05654,-0.954991,131.0,302.0,171.0
272,Marvin Pieringer,Heidenheim,de Bundesliga,FW,1513.0,0.059335,-0.926931,130.0,297.0,167.0
207,Frank Magri,Toulouse,fr Ligue 1,FW,1438.0,0.108821,-0.893151,126.0,289.0,163.0
52,Moritz Broschinski,Bochum,de Bundesliga,FW,1247.0,0.035909,-0.949956,138.0,300.0,162.0
103,Odsonne Édouard,Crystal Palace,eng Premier League,FW,1555.0,-0.140762,-0.968273,169.0,308.0,139.0
338,Oscar Vilhelmsson,Darmstadt 98,de Bundesliga,FW,1374.0,-0.237831,-1.219021,192.0,331.0,139.0
12,Zeki Amdouni,Burnley,eng Premier League,FW,1953.0,-0.214184,-1.062565,185.0,319.0,134.0
114,Evan Ferguson,Brighton,eng Premier League,FW,1367.0,-0.218354,-1.067922,186.0,320.0,134.0
49,Ben Brereton,Sheffield United,eng Premier League,FW,1105.0,0.160603,-0.581246,115.0,242.0,127.0


## QA: correlation with minutes (before and after)

In [11]:
def corr_with_minutes(df, score_col):
    return df[[score_col, "playing_time_min"]].corr().iloc[0,1]

for col in ["talent_score_v1", "talent_score_pos_adj", "talent_score_league_adj", "talent_score_comp_pos_adj"]:
    print(col, "corr(score, minutes) =", float(corr_with_minutes(scored, col)))

talent_score_v1 corr(score, minutes) = 0.36575367673980647
talent_score_pos_adj corr(score, minutes) = 0.3942872562016533
talent_score_league_adj corr(score, minutes) = 0.35883478694123
talent_score_comp_pos_adj corr(score, minutes) = 0.3846296941133272


## Export outputs

We export:
- Top 50 ranking (Talent Score v1)  
- A compact dataset with both scores for downstream use

Files:
- `reports/tables/talent_score_v1_top50.csv`
- `reports/tables/talent_score_v1_scored_universe.csv`


In [12]:
PROJECT_ROOT = Path.cwd().parent
REPORTS_DIR = PROJECT_ROOT / "reports" / "tables"
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

export_cols = [
    "season","player_id","club_id","player","squad","comp","age","pos_primary","playing_time_min",
    "per_90_minutes_gls","per_90_minutes_ast","performance_gls","performance_ast",
    "talent_score_v1","talent_score_pca",
    "talent_score_pos_adj","talent_score_league_adj","talent_score_comp_pos_adj",
]

top50 = scored.sort_values("talent_score_comp_pos_adj", ascending=False).loc[:, export_cols].head(50)

out_top50 = REPORTS_DIR / "talent_score_v1_top50_comp_pos_adj.csv"
top50.to_csv(out_top50, index=False)

out_full = REPORTS_DIR / "talent_score_v1_scored_universe.csv"
scored.to_csv(out_full, index=False)

print("Saved:", out_top50)
print("Saved:", out_full)

display(top50.head(10))


Saved: c:\Users\manue\Projects\risk-adjusted-scouting\reports\tables\talent_score_v1_top50_comp_pos_adj.csv
Saved: c:\Users\manue\Projects\risk-adjusted-scouting\reports\tables\talent_score_v1_scored_universe.csv


Unnamed: 0,season,player_id,club_id,player,squad,comp,age,pos_primary,playing_time_min,per_90_minutes_gls,per_90_minutes_ast,performance_gls,performance_ast,talent_score_v1,talent_score_pca,talent_score_pos_adj,talent_score_league_adj,talent_score_comp_pos_adj
262,2023-2024,568177,631,Cole Palmer,Chelsea,eng Premier League,21.0,MF,2607.0,0.76,0.38,22.0,11.0,2.766822,6.333577,4.208412,3.256699,3.407091
277,2023-2024,315779,5,Christian Pulisic,Milan,it Serie A,24.0,MF,2602.0,0.42,0.28,12.0,8.0,1.468153,3.437785,2.314508,2.391195,3.160788
190,2023-2024,357164,5,Rafael Leão,Milan,it Serie A,24.0,MF,2512.0,0.32,0.32,9.0,9.0,1.377479,3.2139,2.182273,2.25254,2.991558
35,2023-2024,581678,418,Jude Bellingham,Real Madrid,es La Liga,20.0,MF,2315.0,0.74,0.23,19.0,6.0,1.872948,4.189788,2.904837,2.624244,2.643636
349,2023-2024,598577,15,Florian Wirtz,Leverkusen,de Bundesliga,20.0,MF,2372.0,0.42,0.42,11.0,11.0,1.97145,4.425157,3.048487,2.00004,2.617927
219,2023-2024,342229,583,Kylian Mbappé,Paris Saint-Germain,fr Ligue 1,24.0,FW,2158.0,1.13,0.29,27.0,7.0,2.925381,6.340812,2.857876,4.179576,2.438431
118,2023-2024,406635,281,Phil Foden,Manchester City,eng Premier League,23.0,MF,2857.0,0.6,0.25,19.0,8.0,1.923273,4.595732,2.978228,2.227053,2.366545
249,2023-2024,566723,873,Michael Olise,Crystal Palace,eng Premier League,21.0,MF,1275.0,0.71,0.42,10.0,6.0,1.904319,3.489609,2.950587,2.203918,2.343164
178,2023-2024,360518,800,Teun Koopmeiners,Atalanta,it Serie A,25.0,MF,2627.0,0.41,0.17,12.0,5.0,0.968762,2.354741,1.586224,1.627553,2.228754
22,2023-2024,548111,1050,Alex Baena,Villarreal,es La Liga,22.0,MF,2579.0,0.07,0.49,2.0,14.0,1.559905,3.639349,2.448313,2.188898,2.217729


## Quick QA checks (sanity)

A few fast checks you can mention in interviews:
- Distribution of scores
- Minutes vs score relationship (should not be extreme)
- Top players are plausible (domain sense check)


In [13]:
summary = scored[["talent_score_v1","talent_score_pca","playing_time_min"]].describe().T
display(summary)

# Correlation with minutes (sanity: should be mild because minutes weight is small)
corr_minutes = scored[["talent_score_v1","playing_time_min"]].corr().iloc[0,1]
print("Correlation (talent_score_v1 vs minutes):", float(corr_minutes))


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
talent_score_v1,361.0,6.15082e-17,0.757634,-1.091289,-0.560115,-0.183692,0.411886,2.925381
talent_score_pca,361.0,-9.841312e-18,1.665126,-2.379424,-1.201204,-0.438687,0.909683,6.340812
playing_time_min,361.0,1844.723,621.590729,905.0,1297.0,1822.0,2312.0,3325.0


Correlation (talent_score_v1 vs minutes): 0.36575367673980647


In [14]:
import duckdb
from pathlib import Path

DB_PATH = Path("../db/scouting.duckdb")

with duckdb.connect(str(DB_PATH)) as con:
    con.register("scored_df", scored)
    con.execute("""
        CREATE OR REPLACE TABLE talent_score_v1_scored_universe AS
        SELECT * FROM scored_df
    """)

print("Saved to DuckDB: talent_score_v1_scored_universe")
print("Rows:", scored.shape)

Saved to DuckDB: talent_score_v1_scored_universe
Rows: (361, 24)


# Conclusion — Talent Score v1 (Interpretability + Robustness Check)

This notebook implemented an interpretable Talent Score (v1) for attacking profiles (MF/FW, 18–25 years, ≥900 minutes) using per-90 offensive production and a light minutes adjustment.

Key outcomes:

* The handcrafted score is **highly aligned with PCA structure** (≈0.99 correlation), validating that the weighted design captures the dominant variance in the data.
* League and position adjustments confirm that **positional bias is stronger than league bias**, with meaningful but controlled changes in Top 30 overlap.
* Correlation with minutes remains **moderate (~0.38–0.41)**, indicating that playing time influences ranking but does not dominate the score.
* Top-ranked players are domain-plausible, and contextual standardization highlights “within-league” and “within-role” standouts.

Overall, Talent Score v1 is:

* Interpretable
* Statistically coherent
* Context-aware
* Portfolio-ready

The next step is to extend this framework into a **risk-adjusted decision layer**, incorporating age trajectory, volatility, and budget constraints.