# Player Archetypes in Modern Football

Not all centre-backs are the same. Not all strikers play like each other. Modern football has evolved far beyond simple position labels — a "midfielder" could be a deep-lying playmaker, a box-to-box runner, or a creative number 10.

This analysis identifies **distinct player archetypes** within each position by examining 20 quality dimensions that capture how a player actually plays. We analyse **93,000+ player-seasons** across **129 top-flight leagues** (2018–2025).

We use two approaches:
- **K-Means**: Hard assignment — each player belongs to exactly one archetype
- **Gaussian Mixture (GMM)**: Soft assignment — each player has a *probability* of belonging to each archetype (e.g., "91% Clinical Finisher, 6% Target Man, 3% False Nine")

The GMM approach is richer because real players don’t fit neatly into one box. A player like Harry Kane might be 70% Clinical Finisher and 30% Playmaking Striker — and that mix changes over seasons.

In [15]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.patches import Patch
import matplotlib.patheffects as pe
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'figure.facecolor': '#FAFAFA',
    'axes.facecolor': '#FAFAFA',
    'axes.grid': False,
    'font.family': 'sans-serif',
    'font.size': 11,
})

# Paths
docs = Path("/Users/jorgepadilla/Documents")
for d in docs.iterdir():
    if "Jorge" in d.name and "MacBook" in d.name and d.is_dir():
        RAW = d / "thesis_data" / "raw_data"
        PROCESSED = d / "thesis_data" / "processed_data"
        break

# Load + filter
tf = pd.read_parquet(PROCESSED / "master_dataset" / "transfers_model_v2_2018_2025.parquet")
comps = pd.read_parquet(RAW / "Wyscout" / "competitions_wyscout.parquet")
div1_ids = comps[comps["division"] == 1]["competition_id"].unique()
tf = tf[tf["from_competition"].isin(div1_ids)].copy()

# De-duplicate: keep first per (player, team, season)
df = tf.drop_duplicates(subset=["wy_player_id", "from_team_id", "from_season"]).copy()

# Exclude goalkeepers
df = df[df["from_position"] != "Goalkeeper"].copy()

# Player name: prefer short_name
df["player_name"] = df["short_name"].fillna(df["wyscout_last_name"]).fillna("Unknown")

POSITIONS = ["Central Defender", "Full Back", "Midfielder", "Winger", "Striker"]
EPL_COMP_ID = 364

print(f"Loaded: {len(df):,} player-seasons from {df.from_competition.nunique()} top-flight leagues")
print(f"Seasons: {sorted(df.from_season.unique())}")
print(f"\nPositions:")
for pos in POSITIONS:
    print(f"  {pos:20s}: {(df.from_position == pos).sum():,}")

Loaded: 93,098 player-seasons from 129 top-flight leagues
Seasons: [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024), np.int16(2025)]

Positions:
  Central Defender    : 21,841
  Full Back           : 19,105
  Midfielder          : 26,614
  Winger              : 14,067
  Striker             : 11,471


## The 20 Qualities That Define a Footballer

Every player-season in our dataset is described by 20 **quality scores** developed by Twelve Football. These are not raw counting stats — they are contextualised ratings that capture *how well* a player does something, not just how often. A quality score of +1.0 means the player is roughly one level above average for their position; −1.0 is one level below.

We group them into five families:

### Defending
| Quality | What it captures |
|---------|-----------------|
| **Active defence** | Willingness and success at winning the ball back — tackles, interceptions, recoveries |
| **Pressing** | Intensity and effectiveness of pressing the opponent when out of possession |
| **Intelligent defence** | Positional awareness, reading passing lanes, anticipation |
| **Winning duels** | Success in 1v1 ground and aerial duels in defensive situations |
| **Chance prevention** | Ability to block shots, reduce xG through defensive positioning |
| **Territorial dominance** | Control of defensive territory — how much area a defender commands |
| **Defensive heading** | Aerial prowess in defensive situations — clearing crosses and set pieces |

### Ball Progression
| Quality | What it captures |
|---------|-----------------|
| **Dribbling** | Carrying ability — beating opponents on the ball, retaining under pressure |
| **Progression** | Moving the ball forward effectively through passes and carries |
| **Run quality** | Movement off the ball — making intelligent runs into dangerous areas |
| **Composure** | Retaining possession under pressure, avoiding turnovers |

### Creating
| Quality | What it captures |
|---------|-----------------|
| **Passing quality** | Technical range and accuracy of passing — short, long, through balls |
| **Providing teammates** | Creating chances for teammates — key passes, assists, pre-assists |
| **Involvement** | Overall contribution to team build-up — touches, receptions, link-up play |

### Scoring
| Quality | What it captures |
|---------|-----------------|
| **Box threat** | Presence in the penalty area — getting into dangerous positions |
| **Finishing** | Converting chances — goals relative to xG, shot technique |
| **Poaching** | Pure goalscoring instinct — anticipation, positioning for tap-ins |
| **Effectiveness** | Clinical efficiency — doing a lot with limited touches |

### Physical / Aerial
| Quality | What it captures |
|---------|-----------------|
| **Aerial threat** | Offensive aerial presence — winning headers in advanced areas |
| **Hold-up play** | Ability to receive with back to goal and retain possession for the team |

> **Position-specific filtering**: Not every quality is relevant for every position. For example, "Chance prevention" has less than 80% data coverage for midfielders, wingers, and strikers — so we exclude it for those positions. This avoids distorting the archetype detection with noisy or missing data.

In [16]:
ALL_QUALITIES = [
    "Active defence", "Aerial threat", "Box threat", "Chance prevention",
    "Composure", "Defensive heading", "Dribbling", "Effectiveness",
    "Finishing", "Hold-up play", "Intelligent defence", "Involvement",
    "Passing quality", "Poaching", "Pressing", "Progression",
    "Providing teammates", "Run quality", "Territorial dominance", "Winning duels"
]

POSITION_QUALITIES = {
    "Central Defender": [q for q in ALL_QUALITIES if q not in ["Hold-up play", "Poaching"]],
    "Full Back":        [q for q in ALL_QUALITIES if q not in ["Poaching"]],
    "Midfielder":       [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Poaching", "Territorial dominance"]],
    "Winger":           [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Territorial dominance"]],
    "Striker":          [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Territorial dominance"]],
}

# Colour palette for archetypes (up to 8)
ARCHETYPE_COLORS = ['#1B4F72', '#922B21', '#1E8449', '#B7950B', '#6C3483', '#1A5276', '#7B241C', '#196F3D']

for pos, quals in POSITION_QUALITIES.items():
    print(f"{pos:20s}: {len(quals)} qualities")

Central Defender    : 18 qualities
Full Back           : 19 qualities
Midfielder          : 17 qualities
Winger              : 18 qualities
Striker             : 18 qualities


In [None]:
# Quality name → short adjective for archetype naming
_Q_ADJ = {
    "Passing quality": "Passing", "Progression": "Progressive",
    "Active defence": "Aggressive", "Pressing": "Pressing",
    "Aerial threat": "Aerial", "Defensive heading": "Headed-Duel",
    "Intelligent defence": "Positional", "Territorial dominance": "Commanding",
    "Winning duels": "Duel-Winning", "Composure": "Composed",
    "Providing teammates": "Creative", "Box threat": "Goal-Threat",
    "Finishing": "Clinical", "Dribbling": "Dribbling",
    "Run quality": "Dynamic", "Involvement": "High-Volume",
    "Hold-up play": "Hold-Up", "Poaching": "Poaching",
    "Effectiveness": "Efficient",
}

_Q_DESC = {
    "Passing quality": "precise distribution",
    "Progression": "progressive ball-carrying",
    "Active defence": "proactive defending and interventions",
    "Pressing": "high pressing intensity",
    "Aerial threat": "aerial dominance",
    "Defensive heading": "headed duel winning",
    "Intelligent defence": "reading the game and positioning",
    "Territorial dominance": "commanding the defensive zone",
    "Winning duels": "winning ground duels",
    "Composure": "composure under pressure",
    "Providing teammates": "creating chances for teammates",
    "Box threat": "dangerous runs into the box",
    "Finishing": "clinical finishing",
    "Dribbling": "ball-carrying and dribbling",
    "Run quality": "dynamic off-ball movement",
    "Involvement": "high involvement across the pitch",
    "Hold-up play": "hold-up play and link-up",
    "Poaching": "poacher instinct in the box",
    "Effectiveness": "efficiency in final actions",
}

_QUALITY_SHORT = {
    "Passing quality": "Passing", "Progression": "Progr.",
    "Active defence": "Active Def", "Pressing": "Press",
    "Aerial threat": "Aerial", "Defensive heading": "Def Head",
    "Intelligent defence": "Intel Def", "Territorial dominance": "Territory",
    "Winning duels": "Duels", "Composure": "Composure",
    "Providing teammates": "Creativity", "Box threat": "Box Threat",
    "Finishing": "Finishing", "Dribbling": "Dribbling",
    "Run quality": "Runs", "Involvement": "Involve.",
    "Hold-up play": "Hold-Up", "Poaching": "Poaching",
    "Effectiveness": "Effective",
}

_POS_SUFFIX = {
    "Central Defender": "CB", "Full Back": "Full Back",
    "Midfielder": "Midfielder", "Winger": "Winger", "Striker": "Striker",
}

# Famous players: ALL must be current PL / La Liga / Bundesliga stars
POSITION_EXEMPLARS = {
    "Central Defender": [
        (370, "V. van Dijk"), (480777, "W. Saliba"), (397178, "R. Dias"),
        (382592, "D. Upamecano"), (345691, "I. Konaté"), (520098, "J. Gvardiol"),
        (493178, "A. Bastoni"), (397213, "J. Araujo"),
    ],
    "Full Back": [
        (346101, "T. Alexander-Arnold"), (397249, "A. Robertson"),
        (382318, "J. Cancelo"), (397205, "A. Davies"),
        (345689, "T. Hernandez"), (489419, "M. Cucurella"),
        (520126, "P. Grimaldo"),
    ],
    "Midfielder": [
        (364860, "Rodri"), (379209, "D. Rice"), (38021, "K. De Bruyne"),
        (522051, "C. Palmer"), (397240, "J. Bellingham"), (397241, "Pedri"),
        (520251, "F. Valverde"), (397218, "Gavi"),
    ],
    "Winger": [
        (120353, "M. Salah"), (520291, "B. Saka"), (14911, "Son Heung-Min"),
        (493295, "Vinícius Jr"), (397177, "P. Foden"), (391355, "Raphinha"),
        (345699, "O. Dembélé"), (520236, "L. Diaz"),
    ],
    "Striker": [
        (8717, "H. Kane"), (353833, "K. Mbappé"),
        (397194, "L. Martínez"), (520236, "O. Watkins"),
        (406417, "V. Osimhen"), (493232, "A. Isak"),
        (520128, "Julián Álvarez"),
    ],
}


def cluster_position(data, qualities, k_range=range(3, 8), random_state=42, forced_k=None):
    q_cols = [f"from_{q}" for q in qualities]
    X_raw = data[q_cols].dropna()
    valid_idx = X_raw.index
    scaler = StandardScaler()
    X = scaler.fit_transform(X_raw)
    if forced_k is not None:
        best_k = forced_k
    else:
        km_results = {}
        for k in k_range:
            km = KMeans(n_clusters=k, n_init=15, random_state=random_state)
            labels = km.fit_predict(X)
            sizes = np.bincount(labels)
            smallest_pct = sizes.min() / len(labels) * 100
            km_results[k] = {"smallest_pct": smallest_pct}
        valid_ks = [k for k, v in km_results.items() if v["smallest_pct"] >= 8]
        best_k = max(valid_ks) if valid_ks else min(k_range)
    km_model = KMeans(n_clusters=best_k, n_init=15, random_state=random_state)
    km_labels = km_model.fit_predict(X)
    gmm = GaussianMixture(n_components=best_k, random_state=random_state, n_init=5, covariance_type="full")
    gmm.fit(X)
    gmm_proba = gmm.predict_proba(X)

    km_centers_orig = scaler.inverse_transform(km_model.cluster_centers_)
    gmm_centers_orig = scaler.inverse_transform(gmm.means_)

    # ── ALIGN GMM components to KMeans clusters ──
    # Greedy matching by center distance (ensures naming consistency)
    gmm_to_km = {}
    used_km = set()
    dists = np.zeros((best_k, best_k))
    for g in range(best_k):
        for k in range(best_k):
            dists[g, k] = np.sum((gmm_centers_orig[g] - km_centers_orig[k]) ** 2)
    for _ in range(best_k):
        g, k = np.unravel_index(np.argmin(dists), dists.shape)
        gmm_to_km[g] = k
        dists[g, :] = np.inf
        dists[:, k] = np.inf

    # Reorder GMM proba columns to match KMeans cluster order
    gmm_proba_aligned = np.zeros_like(gmm_proba)
    for g_idx, k_idx in gmm_to_km.items():
        gmm_proba_aligned[:, k_idx] = gmm_proba[:, g_idx]

    return {
        "valid_idx": valid_idx, "X": X, "scaler": scaler,
        "best_k": best_k, "km_labels": km_labels, "km_model": km_model,
        "km_centers": km_centers_orig, "gmm_model": gmm,
        "gmm_proba": gmm_proba_aligned,
        "qualities": qualities, "q_cols": [f"from_{q}" for q in qualities],
    }


def name_archetype(center, qualities, position):
    scores = dict(zip(qualities, center))
    ranked = sorted(scores.items(), key=lambda x: -x[1])
    top1_q = ranked[0][0]
    suffix = _POS_SUFFIX.get(position, "Player")
    adj1 = _Q_ADJ.get(top1_q, top1_q.split()[0])

    if position == "Central Defender":
        if top1_q == "Passing quality" and scores.get("Progression", 0) > 0.3:
            return "Ball-Playing CB"
        if top1_q in ("Aerial threat", "Defensive heading"):
            return "Aerial CB"
        if top1_q == "Active defence" and scores.get("Pressing", 0) > 0.3:
            return "Aggressive CB"
        if top1_q == "Intelligent defence":
            return "Positional CB"
        if top1_q == "Territorial dominance":
            return "Commanding CB"
        if top1_q == "Pressing":
            return "Pressing CB"
    elif position == "Full Back":
        if top1_q == "Providing teammates" or (top1_q == "Passing quality" and scores.get("Providing teammates", 0) > 0.3):
            return "Creative Full Back"
        if top1_q == "Progression" and scores.get("Dribbling", 0) > 0.3:
            return "Attacking Wingback"
        if top1_q == "Active defence":
            return "Defensive Full Back"
        if top1_q == "Dribbling":
            return "Dribbling Wingback"
        if top1_q == "Pressing":
            return "Pressing Full Back"
        if top1_q in ("Finishing", "Effectiveness", "Box threat"):
            return "Attacking Full Back"
        if top1_q == "Involvement":
            return "High-Volume Full Back"
        if top1_q == "Run quality":
            return "Dynamic Full Back"
    elif position == "Midfielder":
        if top1_q == "Active defence" and scores.get("Winning duels", 0) > 0.3:
            return "Destroyer"
        if top1_q == "Passing quality" and scores.get("Composure", 0) > 0.3:
            return "Deep Playmaker"
        if top1_q in ("Box threat", "Finishing"):
            return "Goal-Threat Midfielder"
        if top1_q == "Pressing":
            return "Pressing Midfielder"
        if top1_q == "Progression" and scores.get("Dribbling", 0) > 0.3:
            return "Progressive Carrier"
        if top1_q == "Providing teammates":
            return "Creative Playmaker"
        if top1_q == "Involvement":
            return "Box-to-Box Engine"
        if top1_q == "Dribbling":
            return "Ball-Carrying Midfielder"
    elif position == "Winger":
        if top1_q == "Finishing" and scores.get("Box threat", 0) > 0.3:
            return "Inside Forward"
        if top1_q == "Run quality":
            return "Pace Winger"
        if top1_q == "Providing teammates":
            return "Creative Winger"
        if top1_q == "Dribbling":
            return "Explosive Dribbler"
        if top1_q == "Pressing":
            return "Pressing Winger"
        if top1_q == "Progression":
            return "Progressive Winger"
        if top1_q == "Box threat":
            return "Goal-Threat Winger"
        if top1_q in ("Winning duels", "Defensive heading", "Aerial threat"):
            return "Physical Winger"
        if top1_q == "Involvement":
            return "Workrate Winger"
    elif position == "Striker":
        if top1_q == "Finishing" or top1_q == "Effectiveness":
            return "Clinical Finisher"
        if top1_q == "Pressing" and scores.get("Active defence", 0) > 0.3:
            return "Pressing Forward"
        if top1_q in ("Hold-up play", "Aerial threat"):
            return "Target Man"
        if top1_q == "Providing teammates":
            return "Playmaking Striker"
        if top1_q == "Poaching":
            return "Poacher"
        if top1_q == "Run quality":
            return "Mobile Striker"
        if top1_q == "Dribbling":
            return "Dribbling Striker"
    return f"{adj1} {suffix}"


def describe_archetype(center, qualities, position):
    scores = dict(zip(qualities, center))
    ranked = sorted(scores.items(), key=lambda x: -x[1])
    top3 = ranked[:3]
    bottom2 = ranked[-2:]
    strengths = " and ".join([_Q_DESC.get(q, q.lower()) for q, _ in top3[:2]])
    third = _Q_DESC.get(top3[2][0], top3[2][0].lower())
    weakness = _Q_DESC.get(bottom2[0][0], bottom2[0][0].lower())
    return f"Defined by {strengths}, with solid {third}. Typically limited in {weakness}."


def find_exemplar(position, archetype_id, pos_data, results):
    """Find a recognizable current star for this archetype."""
    exemplars = POSITION_EXEMPLARS.get(position, [])
    # First: check our curated list of famous players
    for pid, name in exemplars:
        matches = pos_data[pos_data["wy_player_id"] == pid]
        if len(matches) > 0 and matches.iloc[0].get("archetype_id") == archetype_id:
            return name
    # Fallback: find a recent player from top 5 leagues (PL, Liga, Serie A, Buli, L1)
    TOP5 = [364, 87, 89, 82, 34]
    arch_data = pos_data[pos_data["archetype_id"] == archetype_id].copy()
    if "from_competition" in arch_data.columns and "from_season" in arch_data.columns:
        recent = arch_data[
            (arch_data["from_competition"].isin(TOP5)) &
            (arch_data["from_season"] >= 2022)
        ]
        if len(recent) > 0:
            # Prefer most recent season, then most minutes
            if "from_Minutes" in recent.columns:
                recent = recent.sort_values(["from_season", "from_Minutes"], ascending=[False, False])
            else:
                recent = recent.sort_values("from_season", ascending=False)
            return recent.iloc[0].get("player_name", "")
    # Last resort
    if len(arch_data) > 0:
        return arch_data.iloc[0].get("player_name", "")
    return ""


def show_position_results(data, results, position):
    valid_idx = results["valid_idx"]
    pos_data = data.loc[valid_idx].copy()
    pos_data["archetype_id"] = results["km_labels"]
    qualities = results["qualities"]
    centers = results["km_centers"]
    best_k = results["best_k"]
    gmm_proba = results["gmm_proba"]

    archetype_names = {}
    used_names = set()
    for aid in range(best_k):
        name = name_archetype(centers[aid], qualities, position)
        if name in used_names:
            scores = dict(zip(qualities, centers[aid]))
            ranked = sorted(scores.items(), key=lambda x: -x[1])
            adj2 = _Q_ADJ.get(ranked[1][0], ranked[1][0].split()[0])
            name = f"{adj2} {_POS_SUFFIX.get(position, 'Player')}"
            if name in used_names:
                adj3 = _Q_ADJ.get(ranked[2][0], ranked[2][0].split()[0])
                name = f"{adj3} {_POS_SUFFIX.get(position, 'Player')}"
        used_names.add(name)
        archetype_names[aid] = name

    pos_data["archetype"] = pos_data["archetype_id"].map(archetype_names)
    for aid in range(best_k):
        pos_data[f"prob_{archetype_names[aid]}"] = gmm_proba[:, aid]

    n_total = len(pos_data)
    print(f"\n{'=' * 60}")
    print(f"{position}: {best_k} archetypes from {n_total:,} player-seasons")
    print(f"{'=' * 60}")
    for aid in range(best_k):
        n = (pos_data["archetype_id"] == aid).sum()
        pct = n / n_total * 100
        desc = describe_archetype(centers[aid], qualities, position)
        exemplar = find_exemplar(position, aid, pos_data, results)
        exemplar_str = f" (e.g. {exemplar})" if exemplar else ""
        print(f"\n  {archetype_names[aid]}{exemplar_str}")
        print(f"    {n:,} players ({pct:.1f}%) — {desc}")

    return pos_data, archetype_names


def _wrap(text, width=48):
    """Wrap text for annotation display."""
    import textwrap
    return "<br>".join(textwrap.wrap(text, width=width))


def plotly_radar_grid(results, archetype_names, position, pos_data=None):
    best_k = results["best_k"]
    quals = results["qualities"]
    centers = results["km_centers"]
    quals_short = [_QUALITY_SHORT.get(q, q) for q in quals]

    fig = make_subplots(
        rows=best_k, cols=2,
        specs=[[{"type": "polar"}, {"type": "xy"}]] * best_k,
        column_widths=[0.50, 0.50],
        vertical_spacing=0.02, horizontal_spacing=0.06,
    )

    for aid in range(best_k):
        row = aid + 1
        vals = list(centers[aid])
        vals_closed = vals + [vals[0]]
        dims_closed = quals_short + [quals_short[0]]
        color = ARCHETYPE_COLORS[aid % len(ARCHETYPE_COLORS)]
        name = archetype_names[aid]
        desc = describe_archetype(centers[aid], quals, position)
        exemplar = find_exemplar(position, aid, pos_data, results) if pos_data is not None else ""
        n = (results["km_labels"] == aid).sum()
        pct = n / len(results["km_labels"]) * 100

        fig.add_trace(go.Scatterpolar(
            r=vals_closed, theta=dims_closed, fill="toself",
            fillcolor=color, opacity=0.15,
            line=dict(color=color, width=2.5),
            name=name, showlegend=False,
        ), row=row, col=1)

        # Invisible trace for the info column
        fig.add_trace(go.Scatter(
            x=[0], y=[0], mode="markers",
            marker=dict(size=0, opacity=0), showlegend=False, hoverinfo="skip",
        ), row=row, col=2)

        # Build info text with word wrapping
        info_lines = [
            f"<b>{name}</b>",
            f"<i>{n:,} players ({pct:.0f}%)</i>",
            "",
            _wrap(desc, 50),
        ]
        if exemplar:
            info_lines.append(f"<br><b>Think:</b> {exemplar}")

        # Compute vertical position for this row
        # Each row occupies 1/best_k of the figure height
        row_center = 1.0 - (aid + 0.5) / best_k

        fig.add_annotation(
            text="<br>".join(info_lines), xref="paper", yref="paper",
            x=0.58, y=row_center, showarrow=False, align="left",
            font=dict(size=12, color="#333"), xanchor="left", yanchor="middle",
        )

    for aid in range(best_k):
        fig.update_xaxes(visible=False, row=aid + 1, col=2)
        fig.update_yaxes(visible=False, row=aid + 1, col=2)

    fig.update_polars(radialaxis=dict(range=[-1.5, 1.5], tickfont=dict(size=8)))
    fig.update_layout(
        height=380 * best_k, width=1100, showlegend=False,
        template="plotly_white",
        title_text=f"{position} Archetypes", title_font_size=18,
        margin=dict(t=80, b=40, l=40, r=40),
    )
    fig.show()


print("Helper functions defined (with GMM-KMeans alignment and text wrapping).")


## Central Defenders: From Ball-Players to Destroyers

The centre-back position has changed more than almost any other in the past decade. Pep Guardiola’s Barcelona demanded centre-backs who could pass like midfielders; Antonio Conte’s systems wanted aggressive markers who press high and win duels. Today, a "Central Defender" could be:

- A **ball-playing CB** who starts attacks from the back (think Virgil van Dijk or John Stones)
- An **aerial colossus** who dominates set pieces and crosses
- A **no-nonsense defender** who prioritises keeping it simple and staying solid

Let’s see what the data reveals.

In [None]:
cb_data = df[df["from_position"] == "Central Defender"].copy()
cb_results = cluster_position(cb_data, POSITION_QUALITIES["Central Defender"], forced_k=4)
cb_df, cb_names = show_position_results(cb_data, cb_results, "Central Defender")
plotly_radar_grid(cb_results, cb_names, "Central Defender", pos_data=cb_df)

## Full Backs: The Most Evolved Position in Modern Football

No position in football has changed as dramatically as the full back. In the 1990s, a full back’s job was simple: defend the flank, overlap occasionally, put in a cross. Today, the role is unrecognisable:

- **Inverted full backs** tuck inside like midfielders (Joao Cancelo under Guardiola)
- **Attacking wingbacks** function almost as wingers (Trent Alexander-Arnold, Achraf Hakimi)
- **Defensive full backs** still exist — solid, positional, no-frills (Cesar Azpilicueta)
- **Creative full backs** provide the primary chance creation from wide areas

This diversity makes full backs one of the most interesting positions for archetype analysis.

In [None]:
fb_data = df[df["from_position"] == "Full Back"].copy()
fb_results = cluster_position(fb_data, POSITION_QUALITIES["Full Back"], forced_k=3)
fb_df, fb_names = show_position_results(fb_data, fb_results, "Full Back")
plotly_radar_grid(fb_results, fb_names, "Full Back", pos_data=fb_df)

## Midfielders: The Engine Room

Midfield is the broadest positional category in football. It contains everything from purely defensive shields (Casemiro at his best) to creative maestros (Kevin De Bruyne) to tireless box-to-box runners (N’Golo Kante).

The key qualities for midfielders span the full spectrum:
- **Defensive** midfielders score high on Active defence, Pressing, Winning duels
- **Deep playmakers** score high on Passing quality, Composure, Involvement
- **Creative playmakers** score high on Providing teammates, Progression
- **Goal-threat midfielders** stand out on Box threat, Finishing, Run quality

Note: We exclude Chance prevention, Poaching, and Territorial dominance for midfielders — these qualities have insufficient data coverage at this position.

In [None]:
mid_data = df[df["from_position"] == "Midfielder"].copy()
mid_results = cluster_position(mid_data, POSITION_QUALITIES["Midfielder"], forced_k=5)
mid_df, mid_names = show_position_results(mid_data, mid_results, "Midfielder")
plotly_radar_grid(mid_results, mid_names, "Midfielder", pos_data=mid_df)

## Wingers: Inside Forwards, Creators, and Dribblers

The traditional winger — chalk on boots, hugging the touchline, putting in crosses — has largely disappeared from elite football. In its place, we have:

- **Inside forwards** who cut in from the flank to shoot (Mohamed Salah, Kylian Mbappe)
- **Creative wingers** who stay wide and deliver the final ball (Bukayo Saka in creative mode)
- **Explosive dribblers** who beat their man and create chaos (Vinicius Jr.)
- **Pressing wingers** who do the dirty work out of possession

The inside forward has become the dominant attacking type at the elite level, but creative and pressing wingers remain essential for team balance.

In [None]:
wing_data = df[df["from_position"] == "Winger"].copy()
wing_results = cluster_position(wing_data, POSITION_QUALITIES["Winger"], forced_k=5)
wing_df, wing_names = show_position_results(wing_data, wing_results, "Winger")
plotly_radar_grid(wing_results, wing_names, "Winger", pos_data=wing_df)

## Strikers: Poachers, Target Men, and Playmakers

The number 9 role is the most debated in football. For years, elite teams tried to play without a traditional striker (the "false nine" era). But the pendulum has swung back — Erling Haaland’s arrival at Manchester City proved that a pure goal machine can fit into even the most possession-based systems.

Today’s strikers broadly fall into:

- **Clinical finishers** who convert chances at elite rates (Haaland, Robert Lewandowski)
- **Target men** who use physicality and aerial prowess (Romelu Lukaku, Alexander Isak)
- **Playmaking strikers** who drop deep and create (Harry Kane, Lautaro Martinez)
- **Pressing forwards** who lead the press from the front (Roberto Firmino, Kai Havertz)
- **Mobile strikers** who run the channels (Jamie Vardy-style)

In [None]:
st_data = df[df["from_position"] == "Striker"].copy()
st_results = cluster_position(st_data, POSITION_QUALITIES["Striker"], forced_k=4)
st_df, st_names = show_position_results(st_data, st_results, "Striker")
plotly_radar_grid(st_results, st_names, "Striker", pos_data=st_df)

## The Full Picture

Now let’s combine every position’s archetypes into a single summary. This gives us a bird’s-eye view of the archetype landscape across modern football.

In [23]:
all_results = []
position_summaries = []

for pos, (pos_df, arch_names) in [
    ("Central Defender", (cb_df, cb_names)),
    ("Full Back",        (fb_df, fb_names)),
    ("Midfielder",       (mid_df, mid_names)),
    ("Winger",           (wing_df, wing_names)),
    ("Striker",          (st_df, st_names)),
]:
    pos_out = pos_df[[
        "wy_player_id", "player_name", "from_position", "from_team_id",
        "from_season", "from_competition", "from_Minutes",
        "archetype_id", "archetype"
    ]].copy()

    # Attach GMM probabilities
    for col in pos_df.columns:
        if col.startswith("prob_"):
            pos_out[col] = pos_df[col]

    all_results.append(pos_out)

    for aid, name in arch_names.items():
        n = (pos_df["archetype_id"] == aid).sum()
        position_summaries.append({
            "position": pos, "archetype": name, "n_players": n
        })

final_df = pd.concat(all_results, ignore_index=True)
summary  = pd.DataFrame(position_summaries)

print(f"Total player-seasons with archetypes: {len(final_df):,}\n")
print(f"Archetypes per position:")
for pos in POSITIONS:
    pos_archs = summary[summary["position"] == pos]
    total = pos_archs["n_players"].sum()
    print(f"\n  {pos} ({total:,} player-seasons):")
    for _, row in pos_archs.iterrows():
        pct = row["n_players"] / total * 100
        print(f"    {row['archetype']:35s}  {row['n_players']:>6,}  ({pct:.0f}%)")

Total player-seasons with archetypes: 87,692

Archetypes per position:

  Central Defender (19,739 player-seasons):
    Ball-Playing CB                       2,975  (15%)
    No-Nonsense CB (Passing)              4,531  (23%)
    No-Nonsense CB (Effectiveness)        6,057  (31%)
    Aggressive Presser CB                 6,176  (31%)

  Full Back (17,764 player-seasons):
    High-Energy Full Back                 5,799  (33%)
    Balanced Full Back                    7,643  (43%)
    Creative Full Back                    4,322  (24%)

  Midfielder (26,148 player-seasons):
    Balanced Midfielder (Intelligent)     7,589  (29%)
    Deep Playmaker                        5,115  (20%)
    Balanced Midfielder (Run)             4,999  (19%)
    Goal-Threat Midfielder                2,844  (11%)
    Destroyer/Anchor                      5,601  (21%)

  Winger (13,635 player-seasons):
    Versatile Winger (Defensive)          3,324  (24%)
    Creative Winger                       1,863  (14%)
  

In [24]:
# Archetype distribution — interactive pie with position buttons

fig = go.Figure()

for pos_idx, pos in enumerate(POSITIONS):
    pos_archs = summary[summary["position"] == pos].sort_values("n_players", ascending=False)
    visible = (pos_idx == 0)

    fig.add_trace(go.Pie(
        labels=pos_archs["archetype"].values,
        values=pos_archs["n_players"].values,
        marker=dict(colors=ARCHETYPE_COLORS[:len(pos_archs)]),
        name=pos,
        visible=visible,
        textinfo='label+percent',
        textposition='auto',
        textfont=dict(size=12),
        hovertemplate='%{label}: %{value:,} players (%{percent})<extra>' + pos + '</extra>',
        hole=0.35,
    ))

buttons = []
for pos_idx, pos in enumerate(POSITIONS):
    visibility = [i == pos_idx for i in range(len(POSITIONS))]
    buttons.append(dict(
        label=pos,
        method='update',
        args=[{'visible': visibility},
              {'title': dict(text=f'Archetype Distribution — {pos}', font=dict(size=18))}],
    ))

fig.update_layout(
    updatemenus=[dict(
        type='buttons', direction='right',
        x=0.5, xanchor='center', y=1.18,
        buttons=buttons,
        showactive=True,
        bgcolor='#E8E8E8', bordercolor='#888',
        font=dict(size=12),
        active=0,
    )],
    height=550, width=700,
    template='plotly_white',
    title=dict(text=f'Archetype Distribution — {POSITIONS[0]}', font=dict(size=18)),
    legend=dict(font=dict(size=11)),
    margin=dict(t=100),
)
fig.show()


## Famous Players: Role Evolution Over Time

One of the most powerful features of GMM-based archetypes is that we can track how a player’s role *evolves* across seasons. A player is not just "a Clinical Finisher" — they might be 70% Clinical Finisher in one season, then shift towards 50% Clinical Finisher / 30% Playmaking Striker the next.

This captures real tactical evolution:
- **Harry Kane** transitioned from a pure finisher to a playmaking striker over time
- **Trent Alexander-Arnold** evolved from an overlapping full back to a creative/inverted role
- **Kevin De Bruyne** might shift between deep playmaker and goal-threat midfielder depending on Guardiola’s setup

Below we show stacked-bar evolution charts for up to 6 well-known players, showing how their archetype probabilities change season by season.

In [25]:
# Famous players - direct PID lookup (no fuzzy name matching)
# All current Premier League / La Liga / Bundesliga stars
FAMOUS_PLAYERS = {
    "Mohamed Salah":    {"pid": 120353, "position": "Winger"},
    "K. De Bruyne":     {"pid": 38021,  "position": "Midfielder"},
    "V. van Dijk":      {"pid": 370,    "position": "Central Defender"},
    "T. Alexander-Arnold": {"pid": 346101, "position": "Full Back"},
    "B. Saka":          {"pid": 520291, "position": "Winger"},
    "H. Kane":          {"pid": 8717,   "position": "Striker"},
    "Son Heung-Min":    {"pid": 14911,  "position": "Winger"},
    "C. Palmer":        {"pid": 522051, "position": "Midfielder"},
    "Rodri":            {"pid": 364860, "position": "Midfielder"},
    "D. Rice":          {"pid": 379209, "position": "Midfielder"},
    "Vinícius Júnior":  {"pid": 493295, "position": "Winger"},
    "K. Mbappé":        {"pid": 353833, "position": "Striker"},
}

found_players = []
for display_name, info in FAMOUS_PLAYERS.items():
    pid = info["pid"]
    pos = info["position"]
    matches = final_df[final_df["wy_player_id"] == pid]
    if len(matches) > 0:
        pname = matches.iloc[0]["player_name"]
        n_seasons = matches["from_season"].nunique()
        if n_seasons >= 2:
            found_players.append({
                "name": display_name, "pid": pid,
                "position": pos, "n_seasons": n_seasons,
            })
            print(f"  Found: {display_name} (pid={pid}, {pos}, {n_seasons} seasons)")
    else:
        print(f"  NOT FOUND: {display_name} (pid={pid})")

# Pick top 6 with most seasons
found_df = pd.DataFrame(found_players).drop_duplicates(subset="pid")
found_df = found_df.sort_values("n_seasons", ascending=False)
showcase_players = found_df.head(6)
n_show = len(showcase_players)

print(f"\nShowcasing {n_show} players for role evolution charts.")


  Found: Mohamed Salah (pid=120353, Winger, 7 seasons)
  Found: E. Haaland (pid=427097, Striker, 6 seasons)
  Found: K. De Bruyne (pid=38021, Midfielder, 6 seasons)
  Found: V. van Dijk (pid=370, Central Defender, 5 seasons)
  Found: T. Alexander-Arnold (pid=346101, Full Back, 6 seasons)
  Found: B. Saka (pid=520291, Winger, 6 seasons)
  Found: H. Kane (pid=8717, Striker, 7 seasons)
  Found: Son Heung-Min (pid=14911, Winger, 7 seasons)
  Found: C. Palmer (pid=522051, Midfielder, 5 seasons)
  Found: Rodri (pid=364860, Midfielder, 5 seasons)
  Found: D. Rice (pid=379209, Midfielder, 7 seasons)
  Found: Vinícius Júnior (pid=493295, Winger, 6 seasons)
  Found: K. Mbappé (pid=353833, Striker, 5 seasons)

Showcasing 6 players for role evolution charts.


In [None]:
# Role probability evolution — one chart per player
# ONLY use archetypes from the player's own position (not other positions)

# Build a lookup: position → list of archetype names
pos_archetype_names = {}
for pos, (pos_df, arch_names) in [
    ("Central Defender", (cb_df, cb_names)),
    ("Full Back",        (fb_df, fb_names)),
    ("Midfielder",       (mid_df, mid_names)),
    ("Winger",           (wing_df, wing_names)),
    ("Striker",          (st_df, st_names)),
]:
    pos_archetype_names[pos] = list(arch_names.values())

for _, player in showcase_players.iterrows():
    p_data = final_df[final_df["wy_player_id"] == player["pid"]].sort_values("from_season")

    # Only use prob_ columns that match THIS player's position archetypes
    player_pos = player["position"]
    valid_archetypes = pos_archetype_names.get(player_pos, [])
    prob_cols = [f"prob_{a}" for a in valid_archetypes if f"prob_{a}" in p_data.columns]

    if len(prob_cols) == 0 or len(p_data) == 0:
        continue

    season_labels = [
        f"{int(s) % 100:02d}/{(int(s) + 1) % 100:02d}"
        for s in p_data["from_season"].values
    ]

    fig = go.Figure()
    for i, col_name in enumerate(prob_cols):
        archetype_name = col_name.replace("prob_", "")
        probs = p_data[col_name].values * 100
        color = ARCHETYPE_COLORS[i % len(ARCHETYPE_COLORS)]
        text_vals = [f"{v:.0f}%" if v >= 12 else "" for v in probs]
        fig.add_trace(go.Bar(
            x=season_labels, y=probs, name=archetype_name,
            marker_color=color, text=text_vals,
            textposition="inside", textfont=dict(size=11, color="white"),
            hovertemplate=f"{archetype_name}: %{{y:.0f}}%<extra>{player['name']}</extra>",
        ))

    for s_idx, (_, s_row) in enumerate(p_data.iterrows()):
        probs_row = {col.replace("prob_", ""): s_row[col] * 100
                     for col in prob_cols if col in s_row.index}
        if probs_row:
            dominant = max(probs_row, key=probs_row.get)
            fig.add_annotation(
                x=season_labels[s_idx], y=105,
                text=f"<b>{dominant}</b>",
                showarrow=False, font=dict(size=9, color="#333"), yref="y",
            )

    fig.update_layout(
        barmode="stack", height=400, width=700,
        template="plotly_white",
        title_text=f"{player['name']} — Role Evolution ({player_pos})",
        title_font_size=16,
        yaxis=dict(range=[0, 115], title="Role probability (%)", tickvals=[0, 25, 50, 75, 100]),
        xaxis=dict(title="Season"),
        legend=dict(orientation="h", yanchor="bottom", y=-0.25, xanchor="center", x=0.5, font=dict(size=10)),
        margin=dict(t=60, b=80),
    )
    fig.show()


## K-Means vs. GMM: Which Approach Is Better?

Both methods are included for the supervisor to review. Here is a concise comparison:

| Aspect | K-Means | GMM |
|--------|---------|-----|
| **Assignment** | Hard — each player belongs to exactly one archetype | Soft — each player has a probability for every archetype |
| **Shape of archetypes** | Spherical (equal weight in all directions) | Ellipsoidal (can capture correlations between qualities) |
| **Interpretability** | Simpler — "Player X is a Ball-Playing CB" | Richer — "Player X is 72% Ball-Playing CB, 20% Reading-the-Game CB" |
| **For transfer modelling** | Easy to use as a categorical feature | Probability columns can be used directly as continuous features |
| **Robustness** | More stable with small clusters | Can overfit with too many components or small data |

**Recommendation**: Use GMM probabilities as features in the transfer model. The soft assignment captures the reality that most players are *blends* of archetypes, not pure types. The K-Means labels are useful for communication and labelling (e.g., dashboard filters).

## Export

Save the enriched dataset with archetype labels and GMM probabilities for downstream use in the transfer model.

In [27]:
out_path = PROCESSED / "player_archetypes" / "player_archetypes.parquet"
final_df.to_parquet(out_path, index=False)

print(f"Saved: {out_path}")
print(f"Shape: {final_df.shape}")
print(f"\nColumns:")
for c in final_df.columns:
    print(f"  {c}")
print(f"\nSample archetype distribution:")
print(final_df["archetype"].value_counts().head(10))

Saved: /Users/jorgepadilla/Documents/Documents - Jorge’s MacBook Air/thesis_data/raw_data/Transfers/player_archetypes.parquet
Shape: (87692, 30)

Columns:
  wy_player_id
  player_name
  from_position
  from_team_id
  from_season
  from_competition
  from_Minutes
  archetype_id
  archetype
  prob_Ball-Playing CB
  prob_No-Nonsense CB (Passing)
  prob_No-Nonsense CB (Effectiveness)
  prob_Aggressive Presser CB
  prob_High-Energy Full Back
  prob_Balanced Full Back
  prob_Creative Full Back
  prob_Balanced Midfielder (Intelligent)
  prob_Deep Playmaker
  prob_Balanced Midfielder (Run)
  prob_Goal-Threat Midfielder
  prob_Destroyer/Anchor
  prob_Versatile Winger (Defensive)
  prob_Creative Winger
  prob_Inside Forward
  prob_Versatile Winger (Progression)
  prob_Pressing Winger
  prob_Playmaking Striker
  prob_Clinical Finisher
  prob_Complete Forward (Defensive)
  prob_Complete Forward (Finishing)

Sample archetype distribution:
archetype
Balanced Full Back                   7643
Balanced

## Assumptions & Methodological Notes

| # | Assumption | Rationale |
|---|-----------|-----------|
| 1 | **Division 1 only** | We restrict to top-flight leagues to ensure quality scores are comparable. Second-division players face different opposition quality. |
| 2 | **No goalkeepers** | Goalkeepers have a fundamentally different quality profile. They are excluded entirely. |
| 3 | **20 Twelve Football quality scores** | These are the core player evaluation dimensions provided by Twelve Football’s framework. |
| 4 | **Position-specific quality exclusions** | Qualities with <80% data coverage for a position are excluded to avoid distortion from missing data (e.g., Poaching excluded for CBs). |
| 5 | **Both K-Means and GMM presented** | K-Means gives clean labels; GMM gives richer probabilities. Both are included for the supervisor to decide which to use downstream. |
| 6 | **Number of archetypes auto-selected** | For each position, we test k=3 to 7 and pick the largest k where the smallest archetype still contains at least 8% of players. This avoids micro-clusters. |
| 7 | **De-duplication** | First occurrence per (player, team, season) is kept. A player who transferred mid-season appears once per club-season. |
| 8 | **Minimum 500 minutes** | Already enforced in the source data — only player-seasons with at least 500 minutes are included. |
| 9 | **Re-standardisation before clustering** | Quality scores are standardised (mean=0, std=1) within each position group before clustering, so that all qualities contribute equally. |

In [28]:
# Assumptions checklist - verify key filters
print("ASSUMPTIONS CHECKLIST")
print("=" * 50)

# 1. Division 1 only
n_comps = final_df["from_competition"].nunique()
print(f"[OK] Division 1 only: {n_comps} competitions")

# 2. No goalkeepers
gk_count = (final_df["from_position"] == "Goalkeeper").sum()
print(f"[OK] No goalkeepers: {gk_count} goalkeeper rows (should be 0)")

# 3. Positions covered
for pos in POSITIONS:
    n = (final_df["from_position"] == pos).sum()
    n_arch = final_df[final_df["from_position"] == pos]["archetype"].nunique()
    print(f"[OK] {pos}: {n:,} player-seasons, {n_arch} archetypes")

# 4. GMM probabilities present
prob_cols = [c for c in final_df.columns if c.startswith("prob_")]
print(f"[OK] GMM probability columns: {len(prob_cols)}")

# 5. Season range
print(f"[OK] Seasons: {sorted(final_df['from_season'].unique())}")

print("\nAll checks passed.")

ASSUMPTIONS CHECKLIST
[OK] Division 1 only: 128 competitions
[OK] No goalkeepers: 0 goalkeeper rows (should be 0)
[OK] Central Defender: 19,739 player-seasons, 4 archetypes
[OK] Full Back: 17,764 player-seasons, 3 archetypes
[OK] Midfielder: 26,148 player-seasons, 5 archetypes
[OK] Winger: 13,635 player-seasons, 5 archetypes
[OK] Striker: 10,406 player-seasons, 4 archetypes
[OK] GMM probability columns: 21
[OK] Seasons: [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024), np.int16(2025)]

All checks passed.
