# Player Archetypes in Modern Football

Not all centre-backs are the same. Not all strikers play like each other. Modern football has evolved far beyond simple position labels — a "midfielder" could be a deep-lying playmaker, a box-to-box runner, or a creative number 10.

This analysis identifies **distinct player archetypes** within each position by examining 20 quality dimensions that capture how a player actually plays. We analyse **93,000+ player-seasons** across **129 top-flight leagues** (2018–2025).

We use two approaches:
- **K-Means**: Hard assignment — each player belongs to exactly one archetype
- **Gaussian Mixture (GMM)**: Soft assignment — each player has a *probability* of belonging to each archetype (e.g., "91% Clinical Finisher, 6% Target Man, 3% False Nine")

The GMM approach is richer because real players don’t fit neatly into one box. A player like Harry Kane might be 70% Clinical Finisher and 30% Playmaking Striker — and that mix changes over seasons.

In [15]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.patches import Patch
import matplotlib.patheffects as pe
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'figure.facecolor': '#FAFAFA',
    'axes.facecolor': '#FAFAFA',
    'axes.grid': False,
    'font.family': 'sans-serif',
    'font.size': 11,
})

# Paths
docs = Path("/Users/jorgepadilla/Documents")
for d in docs.iterdir():
    if "Jorge" in d.name and "MacBook" in d.name and d.is_dir():
        RAW = d / "thesis_data" / "raw_data"
        PROCESSED = d / "thesis_data" / "processed_data"
        break

# Load + filter
tf = pd.read_parquet(PROCESSED / "master_dataset" / "transfers_model_v2_2018_2025.parquet")
comps = pd.read_parquet(RAW / "Wyscout" / "competitions_wyscout.parquet")
div1_ids = comps[comps["division"] == 1]["competition_id"].unique()
tf = tf[tf["from_competition"].isin(div1_ids)].copy()

# De-duplicate: keep first per (player, team, season)
df = tf.drop_duplicates(subset=["wy_player_id", "from_team_id", "from_season"]).copy()

# Exclude goalkeepers
df = df[df["from_position"] != "Goalkeeper"].copy()

# Player name: prefer short_name
df["player_name"] = df["short_name"].fillna(df["wyscout_last_name"]).fillna("Unknown")

POSITIONS = ["Central Defender", "Full Back", "Midfielder", "Winger", "Striker"]
EPL_COMP_ID = 364

print(f"Loaded: {len(df):,} player-seasons from {df.from_competition.nunique()} top-flight leagues")
print(f"Seasons: {sorted(df.from_season.unique())}")
print(f"\nPositions:")
for pos in POSITIONS:
    print(f"  {pos:20s}: {(df.from_position == pos).sum():,}")

Loaded: 93,098 player-seasons from 129 top-flight leagues
Seasons: [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024), np.int16(2025)]

Positions:
  Central Defender    : 21,841
  Full Back           : 19,105
  Midfielder          : 26,614
  Winger              : 14,067
  Striker             : 11,471


## The 20 Qualities That Define a Footballer

Every player-season in our dataset is described by 20 **quality scores** developed by Twelve Football. These are not raw counting stats — they are contextualised ratings that capture *how well* a player does something, not just how often. A quality score of +1.0 means the player is roughly one level above average for their position; −1.0 is one level below.

We group them into five families:

### Defending
| Quality | What it captures |
|---------|-----------------|
| **Active defence** | Willingness and success at winning the ball back — tackles, interceptions, recoveries |
| **Pressing** | Intensity and effectiveness of pressing the opponent when out of possession |
| **Intelligent defence** | Positional awareness, reading passing lanes, anticipation |
| **Winning duels** | Success in 1v1 ground and aerial duels in defensive situations |
| **Chance prevention** | Ability to block shots, reduce xG through defensive positioning |
| **Territorial dominance** | Control of defensive territory — how much area a defender commands |
| **Defensive heading** | Aerial prowess in defensive situations — clearing crosses and set pieces |

### Ball Progression
| Quality | What it captures |
|---------|-----------------|
| **Dribbling** | Carrying ability — beating opponents on the ball, retaining under pressure |
| **Progression** | Moving the ball forward effectively through passes and carries |
| **Run quality** | Movement off the ball — making intelligent runs into dangerous areas |
| **Composure** | Retaining possession under pressure, avoiding turnovers |

### Creating
| Quality | What it captures |
|---------|-----------------|
| **Passing quality** | Technical range and accuracy of passing — short, long, through balls |
| **Providing teammates** | Creating chances for teammates — key passes, assists, pre-assists |
| **Involvement** | Overall contribution to team build-up — touches, receptions, link-up play |

### Scoring
| Quality | What it captures |
|---------|-----------------|
| **Box threat** | Presence in the penalty area — getting into dangerous positions |
| **Finishing** | Converting chances — goals relative to xG, shot technique |
| **Poaching** | Pure goalscoring instinct — anticipation, positioning for tap-ins |
| **Effectiveness** | Clinical efficiency — doing a lot with limited touches |

### Physical / Aerial
| Quality | What it captures |
|---------|-----------------|
| **Aerial threat** | Offensive aerial presence — winning headers in advanced areas |
| **Hold-up play** | Ability to receive with back to goal and retain possession for the team |

> **Position-specific filtering**: Not every quality is relevant for every position. For example, "Chance prevention" has less than 80% data coverage for midfielders, wingers, and strikers — so we exclude it for those positions. This avoids distorting the archetype detection with noisy or missing data.

In [16]:
ALL_QUALITIES = [
    "Active defence", "Aerial threat", "Box threat", "Chance prevention",
    "Composure", "Defensive heading", "Dribbling", "Effectiveness",
    "Finishing", "Hold-up play", "Intelligent defence", "Involvement",
    "Passing quality", "Poaching", "Pressing", "Progression",
    "Providing teammates", "Run quality", "Territorial dominance", "Winning duels"
]

POSITION_QUALITIES = {
    "Central Defender": [q for q in ALL_QUALITIES if q not in ["Hold-up play", "Poaching"]],
    "Full Back":        [q for q in ALL_QUALITIES if q not in ["Poaching"]],
    "Midfielder":       [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Poaching", "Territorial dominance"]],
    "Winger":           [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Territorial dominance"]],
    "Striker":          [q for q in ALL_QUALITIES if q not in ["Chance prevention", "Territorial dominance"]],
}

# Colour palette for archetypes (up to 8)
ARCHETYPE_COLORS = ['#1B4F72', '#922B21', '#1E8449', '#B7950B', '#6C3483', '#1A5276', '#7B241C', '#196F3D']

for pos, quals in POSITION_QUALITIES.items():
    print(f"{pos:20s}: {len(quals)} qualities")

Central Defender    : 18 qualities
Full Back           : 19 qualities
Midfielder          : 17 qualities
Winger              : 18 qualities
Striker             : 18 qualities


In [17]:
def cluster_position(data, qualities, k_range=range(3, 8), random_state=42, forced_k=None):
    """
    Cluster a position using both K-Means and GMM.
    If forced_k is provided, skip auto-selection and use that k directly.
    Returns a results dict with models, labels, probabilities, and metadata.
    """
    q_cols = [f"from_{q}" for q in qualities]
    X_raw = data[q_cols].dropna()
    valid_idx = X_raw.index

    scaler = StandardScaler()
    X = scaler.fit_transform(X_raw)

    if forced_k is not None:
        best_k = forced_k
    else:
        # K-Means: test range, pick k where smallest archetype >= 8% of total
        km_results = {}
        for k in k_range:
            km = KMeans(n_clusters=k, n_init=15, random_state=random_state)
            labels = km.fit_predict(X)
            sizes = np.bincount(labels)
            smallest_pct = sizes.min() / len(labels) * 100
            km_results[k] = {"labels": labels, "model": km, "smallest_pct": smallest_pct}

        valid_ks = [k for k, v in km_results.items() if v["smallest_pct"] >= 8]
        best_k = max(valid_ks) if valid_ks else min(k_range)

    # K-Means at best_k
    km_model = KMeans(n_clusters=best_k, n_init=15, random_state=random_state)
    km_labels = km_model.fit_predict(X)

    # GMM at same k
    gmm = GaussianMixture(
        n_components=best_k, random_state=random_state,
        n_init=5, covariance_type='full'
    )
    gmm.fit(X)
    gmm_labels = gmm.predict(X)
    gmm_proba  = gmm.predict_proba(X)

    # K-Means centres in original (quality-score) scale
    km_centers_orig = scaler.inverse_transform(km_model.cluster_centers_)

    return {
        "valid_idx": valid_idx,
        "X": X,
        "scaler": scaler,
        "best_k": best_k,
        "km_labels": km_labels,
        "km_model": km_model,
        "km_centers": km_centers_orig,
        "gmm_model": gmm,
        "gmm_labels": gmm_labels,
        "gmm_proba": gmm_proba,
        "qualities": qualities,
        "q_cols": [f"from_{q}" for q in qualities],
    }


def name_archetype(center, qualities, position):
    """Auto-name an archetype based on its dominant qualities."""
    scores = dict(zip(qualities, center))

    if position == "Central Defender":
        if scores.get("Passing quality", 0) > 0.5 and scores.get("Progression", 0) > 0.3:
            return "Ball-Playing CB"
        if scores.get("Aerial threat", 0) > 0.5 and scores.get("Defensive heading", 0) > 0.3:
            return "Aerial Dominant CB"
        if scores.get("Active defence", 0) > 0.5 and scores.get("Pressing", 0) > 0.3:
            return "Aggressive Presser CB"
        if scores.get("Intelligent defence", 0) > 0.5:
            return "Reading-the-Game CB"
        if scores.get("Territorial dominance", 0) > 0.5:
            return "Commanding CB"
        if scores.get("Winning duels", 0) > 0.3:
            return "Duel-Winning CB"
        return "No-Nonsense CB"

    elif position == "Full Back":
        if scores.get("Providing teammates", 0) > 0.5 or scores.get("Passing quality", 0) > 0.5:
            return "Creative Full Back"
        if scores.get("Progression", 0) > 0.5 and scores.get("Dribbling", 0) > 0.3:
            return "Attacking Wingback"
        if scores.get("Active defence", 0) > 0.5 and scores.get("Territorial dominance", 0) > 0.3:
            return "Defensive Full Back"
        if scores.get("Run quality", 0) > 0.5:
            return "Overlapping Full Back"
        if scores.get("Composure", 0) > 0.5:
            return "Inverted Full Back"
        if scores.get("Pressing", 0) > 0.3:
            return "High-Energy Full Back"
        return "Balanced Full Back"

    elif position == "Midfielder":
        # Destroyer/Anchor: high Active defence + high Winning duels + low Providing teammates
        if (scores.get("Active defence", 0) > 0.4
            and scores.get("Winning duels", 0) > 0.3
            and scores.get("Providing teammates", 0) < 0.1):
            return "Destroyer/Anchor"
        if scores.get("Passing quality", 0) > 0.5 and scores.get("Composure", 0) > 0.3:
            return "Deep Playmaker"
        if scores.get("Box threat", 0) > 0.3 and scores.get("Finishing", 0) > 0.3:
            return "Goal-Threat Midfielder"
        if scores.get("Pressing", 0) > 0.5 and scores.get("Active defence", 0) > 0.3:
            return "Pressing Midfielder"
        if scores.get("Progression", 0) > 0.5 and scores.get("Dribbling", 0) > 0.3:
            return "Progressive Carrier"
        if scores.get("Providing teammates", 0) > 0.5:
            return "Creative Playmaker"
        if scores.get("Involvement", 0) > 0.5:
            return "Box-to-Box Engine"
        if scores.get("Active defence", 0) > 0.3 and scores.get("Winning duels", 0) > 0.3:
            return "Holding Midfielder"
        return "Balanced Midfielder"

    elif position == "Winger":
        if scores.get("Finishing", 0) > 0.5 and scores.get("Box threat", 0) > 0.3:
            return "Inside Forward"
        # Runner/Speedster: high Run quality + high Progression + moderate/low Providing teammates
        if (scores.get("Run quality", 0) > 0.4
            and scores.get("Progression", 0) > 0.3
            and scores.get("Providing teammates", 0) < 0.3):
            return "Runner/Speedster"
        if scores.get("Providing teammates", 0) > 0.5:
            return "Creative Winger"
        if scores.get("Dribbling", 0) > 0.5 and scores.get("Run quality", 0) > 0.3:
            return "Explosive Dribbler"
        if scores.get("Pressing", 0) > 0.5:
            return "Pressing Winger"
        if scores.get("Progression", 0) > 0.5:
            return "Progressive Winger"
        if scores.get("Involvement", 0) > 0.3:
            return "Link-Up Winger"
        return "Versatile Winger"

    elif position == "Striker":
        if scores.get("Finishing", 0) > 0.5 and scores.get("Effectiveness", 0) > 0.3:
            return "Clinical Finisher"
        # Pressing Forward: high Pressing + high Active defence
        if scores.get("Pressing", 0) > 0.4 and scores.get("Active defence", 0) > 0.3:
            return "Pressing Forward"
        if scores.get("Hold-up play", 0) > 0.5 and scores.get("Aerial threat", 0) > 0.3:
            return "Target Man"
        if scores.get("Providing teammates", 0) > 0.5 and scores.get("Involvement", 0) > 0.3:
            return "Playmaking Striker"
        if scores.get("Dribbling", 0) > 0.5:
            return "Dribbling Striker"
        if scores.get("Poaching", 0) > 0.5:
            return "Poacher"
        if scores.get("Run quality", 0) > 0.5:
            return "Mobile Striker"
        if scores.get("Box threat", 0) > 0.3:
            return "Box Presence Striker"
        return "Complete Forward"

    return "Unknown"


def show_position_results(data, results, position):
    """
    Print archetype summaries and example players for one position.
    Returns the annotated DataFrame and archetype name mapping.
    """
    valid_idx = results["valid_idx"]
    pos_data = data.loc[valid_idx].copy()
    pos_data["archetype_id"] = results["km_labels"]
    qualities = results["qualities"]
    centers   = results["km_centers"]
    best_k    = results["best_k"]
    gmm_proba = results["gmm_proba"]

    # Name archetypes — ensure no duplicates
    archetype_names = {}
    used_names = {}
    for aid in range(best_k):
        name = name_archetype(centers[aid], qualities, position)
        if name in used_names:
            # Differentiate using the strongest secondary trait
            prev_aid = used_names[name]
            # Rename previous one if not already renamed
            if archetype_names[prev_aid] == name:
                prev_scores = dict(zip(qualities, centers[prev_aid]))
                prev_top = sorted(prev_scores.items(), key=lambda x: -x[1])
                prev_secondary = prev_top[1][0] if len(prev_top) > 1 else prev_top[0][0]
                archetype_names[prev_aid] = f"{name} ({prev_secondary.split()[0]})"
            # Name current one with its secondary trait
            curr_scores = dict(zip(qualities, centers[aid]))
            curr_top = sorted(curr_scores.items(), key=lambda x: -x[1])
            curr_secondary = curr_top[1][0] if len(curr_top) > 1 else curr_top[0][0]
            name = f"{name} ({curr_secondary.split()[0]})"
        used_names[name] = aid
        archetype_names[aid] = name

    pos_data["archetype"] = pos_data["archetype_id"].map(archetype_names)

    # Store GMM probabilities
    for aid in range(best_k):
        pos_data[f"prob_{archetype_names[aid]}"] = gmm_proba[:, aid]

    # Print summary
    print(f"\n{'=' * 70}")
    print(f"  {position.upper()}: {best_k} ARCHETYPES IDENTIFIED")
    print(f"{'=' * 70}")

    for aid in range(best_k):
        mask = pos_data["archetype_id"] == aid
        n    = mask.sum()
        pct  = n / len(pos_data) * 100
        name = archetype_names[aid]

        # Prefer EPL examples
        epl = pos_data[mask & (pos_data["from_competition"] == EPL_COMP_ID)]
        if len(epl) >= 3:
            top = epl.nlargest(5, "from_Minutes")
        else:
            top = pos_data[mask].nlargest(5, "from_Minutes")
        top_names = list(top["player_name"].unique()[:5])

        print(f"\n  {name}  ({n:,} player-seasons, {pct:.0f}%)")
        print(f"  Examples: {', '.join(top_names)}")

        # Show standout qualities
        for q in qualities:
            val = centers[aid][qualities.index(q)]
            if abs(val) > 0.3:
                arrow = "^" if val > 0 else "v"
                print(f"    {arrow} {q}: {val:+.2f}")

    return pos_data, archetype_names


# Abbreviation map for long quality names on radar axes
_QUALITY_SHORT = {
    "Active defence": "Act. Def.",
    "Aerial threat": "Aerial",
    "Box threat": "Box Thr.",
    "Chance prevention": "Ch. Prev.",
    "Composure": "Compos.",
    "Defensive heading": "Def. Head.",
    "Dribbling": "Dribble",
    "Effectiveness": "Effect.",
    "Finishing": "Finish",
    "Hold-up play": "Hold-up",
    "Intelligent defence": "Int. Def.",
    "Involvement": "Involv.",
    "Passing quality": "Pass Qual.",
    "Poaching": "Poach",
    "Pressing": "Press",
    "Progression": "Progr.",
    "Providing teammates": "Prov. TM",
    "Run quality": "Run Qual.",
    "Territorial dominance": "Territ.",
    "Winning duels": "Win Duels",
}


def plotly_radar_grid(results, archetype_names, position):
    """Build a Plotly radar subplot for every archetype in a position."""
    best_k  = results["best_k"]
    quals   = results["qualities"]
    centers = results["km_centers"]

    # Use abbreviated names for theta axis
    quals_short = [_QUALITY_SHORT.get(q, q) for q in quals]

    fig = make_subplots(
        rows=1, cols=best_k,
        specs=[[{'type': 'polar'}] * best_k],
        subplot_titles=[archetype_names[i] for i in range(best_k)],
        horizontal_spacing=0.08,
    )

    for aid in range(best_k):
        vals = list(centers[aid])
        vals_closed = vals + [vals[0]]
        dims_closed = quals_short + [quals_short[0]]

        fig.add_trace(
            go.Scatterpolar(
                r=vals_closed,
                theta=dims_closed,
                fill='toself',
                fillcolor=ARCHETYPE_COLORS[aid % len(ARCHETYPE_COLORS)],
                opacity=0.2,
                line=dict(color=ARCHETYPE_COLORS[aid % len(ARCHETYPE_COLORS)], width=2.5),
                name=archetype_names[aid],
            ),
            row=1, col=aid + 1,
        )

    fig.update_polars(radialaxis=dict(range=[-1.5, 1.5], tickfont=dict(size=6)))
    fig.update_layout(
        height=450,
        width=max(300 * best_k, 700),
        showlegend=False,
        template='plotly_white',
        title_text=f'{position} Archetypes — Quality Profiles',
        margin=dict(t=80, b=60),
    )
    fig.show()

print("Helper functions defined.")

Helper functions defined.


## Central Defenders: From Ball-Players to Destroyers

The centre-back position has changed more than almost any other in the past decade. Pep Guardiola’s Barcelona demanded centre-backs who could pass like midfielders; Antonio Conte’s systems wanted aggressive markers who press high and win duels. Today, a "Central Defender" could be:

- A **ball-playing CB** who starts attacks from the back (think Virgil van Dijk or John Stones)
- An **aerial colossus** who dominates set pieces and crosses
- A **no-nonsense defender** who prioritises keeping it simple and staying solid

Let’s see what the data reveals.

In [18]:
cb_data = df[df["from_position"] == "Central Defender"].copy()
cb_results = cluster_position(cb_data, POSITION_QUALITIES["Central Defender"], forced_k=4)
cb_df, cb_names = show_position_results(cb_data, cb_results, "Central Defender")
plotly_radar_grid(cb_results, cb_names, "Central Defender")


  CENTRAL DEFENDER: 4 ARCHETYPES IDENTIFIED

  Ball-Playing CB  (2,975 player-seasons, 15%)
  Examples: H. Maguire, J. Tarkowski, V. van Dijk, C. Basham
    ^ Active defence: +0.63
    ^ Aerial threat: +0.40
    ^ Box threat: +0.60
    ^ Composure: +0.44
    ^ Defensive heading: +0.38
    ^ Dribbling: +0.51
    ^ Effectiveness: +0.34
    ^ Intelligent defence: +0.55
    ^ Involvement: +0.89
    ^ Passing quality: +1.10
    ^ Pressing: +0.53
    ^ Progression: +1.28
    ^ Providing teammates: +0.77
    ^ Run quality: +0.73
    ^ Territorial dominance: +0.57

  No-Nonsense CB (Passing)  (4,531 player-seasons, 23%)
  Examples: B. Mee, L. Dunk, M. Guéhi, T. Mings, Gabriel Magalhaes
    v Defensive heading: -0.41
    ^ Passing quality: +0.45
    ^ Progression: +0.55

  No-Nonsense CB (Effectiveness)  (6,057 player-seasons, 31%)
  Examples: J. Tarkowski, M. Kilman, C. Coady, M. Guéhi, I. Zabarnyi
    v Active defence: -0.45
    v Composure: -0.30
    v Intelligent defence: -0.38
    v Invol

## Full Backs: The Most Evolved Position in Modern Football

No position in football has changed as dramatically as the full back. In the 1990s, a full back’s job was simple: defend the flank, overlap occasionally, put in a cross. Today, the role is unrecognisable:

- **Inverted full backs** tuck inside like midfielders (Joao Cancelo under Guardiola)
- **Attacking wingbacks** function almost as wingers (Trent Alexander-Arnold, Achraf Hakimi)
- **Defensive full backs** still exist — solid, positional, no-frills (Cesar Azpilicueta)
- **Creative full backs** provide the primary chance creation from wide areas

This diversity makes full backs one of the most interesting positions for archetype analysis.

In [19]:
fb_data = df[df["from_position"] == "Full Back"].copy()
fb_results = cluster_position(fb_data, POSITION_QUALITIES["Full Back"], forced_k=3)
fb_df, fb_names = show_position_results(fb_data, fb_results, "Full Back")
plotly_radar_grid(fb_results, fb_names, "Full Back")


  FULL BACK: 3 ARCHETYPES IDENTIFIED

  High-Energy Full Back  (5,799 player-seasons, 33%)
  Examples: M. Targett, K. Trippier, César Azpilicueta, A. Robinson, T. Mitchell
    ^ Active defence: +0.68
    ^ Aerial threat: +0.34
    ^ Defensive heading: +0.53
    ^ Intelligent defence: +0.59
    ^ Involvement: +0.45
    ^ Pressing: +0.50
    ^ Winning duels: +0.41

  Balanced Full Back  (7,643 player-seasons, 43%)
  Examples: G. Baldock, M. Cash, M. Kerkez, C. Taylor, M. Aarons
    v Active defence: -0.33
    v Intelligent defence: -0.31
    v Involvement: -0.39
    v Passing quality: -0.37
    v Progression: -0.35

  Creative Full Back  (4,322 player-seasons, 24%)
  Examples: A. Robertson, E. Stevens, B. Chilwell, D. Muñoz, João Cancelo
    ^ Box threat: +0.76
    ^ Dribbling: +0.58
    ^ Effectiveness: +0.36
    ^ Finishing: +0.33
    ^ Hold-up play: +0.41
    ^ Involvement: +0.36
    ^ Passing quality: +0.87
    ^ Progression: +0.71
    ^ Providing teammates: +0.91
    ^ Run quality:

## Midfielders: The Engine Room

Midfield is the broadest positional category in football. It contains everything from purely defensive shields (Casemiro at his best) to creative maestros (Kevin De Bruyne) to tireless box-to-box runners (N’Golo Kante).

The key qualities for midfielders span the full spectrum:
- **Defensive** midfielders score high on Active defence, Pressing, Winning duels
- **Deep playmakers** score high on Passing quality, Composure, Involvement
- **Creative playmakers** score high on Providing teammates, Progression
- **Goal-threat midfielders** stand out on Box threat, Finishing, Run quality

Note: We exclude Chance prevention, Poaching, and Territorial dominance for midfielders — these qualities have insufficient data coverage at this position.

In [20]:
mid_data = df[df["from_position"] == "Midfielder"].copy()
mid_results = cluster_position(mid_data, POSITION_QUALITIES["Midfielder"], forced_k=5)
mid_df, mid_names = show_position_results(mid_data, mid_results, "Midfielder")
plotly_radar_grid(mid_results, mid_names, "Midfielder")


  MIDFIELDER: 5 ARCHETYPES IDENTIFIED

  Balanced Midfielder (Intelligent)  (7,589 player-seasons, 29%)
  Examples: L. Milivojević, J. Cork, M. Sissoko, J. Ward-Prowse, A. Westwood
    v Box threat: -0.50
    v Dribbling: -0.38
    v Hold-up play: -0.31
    v Passing quality: -0.55
    v Progression: -0.52
    v Providing teammates: -0.53
    v Run quality: -0.49

  Deep Playmaker  (5,115 player-seasons, 20%)
  Examples: A. Westwood, Y. Tielemans, J. McGinn, Bruno Guimarães
    ^ Composure: +0.39
    ^ Involvement: +0.61
    ^ Passing quality: +0.76
    ^ Progression: +0.92
    ^ Providing teammates: +0.34

  Balanced Midfielder (Run)  (4,999 player-seasons, 19%)
  Examples: A. Doucouré, G. Wijnaldum, G. Sigurdsson, C. Gallagher, D. Rice
    v Active defence: -0.63
    v Aerial threat: -0.39
    ^ Box threat: +0.35
    v Defensive heading: -0.50
    v Intelligent defence: -0.62
    v Involvement: -0.42
    v Pressing: -0.54
    v Winning duels: -0.44

  Goal-Threat Midfielder  (2,844 

## Wingers: Inside Forwards, Creators, and Dribblers

The traditional winger — chalk on boots, hugging the touchline, putting in crosses — has largely disappeared from elite football. In its place, we have:

- **Inside forwards** who cut in from the flank to shoot (Mohamed Salah, Kylian Mbappe)
- **Creative wingers** who stay wide and deliver the final ball (Bukayo Saka in creative mode)
- **Explosive dribblers** who beat their man and create chaos (Vinicius Jr.)
- **Pressing wingers** who do the dirty work out of possession

The inside forward has become the dominant attacking type at the elite level, but creative and pressing wingers remain essential for team balance.

In [21]:
wing_data = df[df["from_position"] == "Winger"].copy()
wing_results = cluster_position(wing_data, POSITION_QUALITIES["Winger"], forced_k=5)
wing_df, wing_names = show_position_results(wing_data, wing_results, "Winger")
plotly_radar_grid(wing_results, wing_names, "Winger")


  WINGER: 5 ARCHETYPES IDENTIFIED

  Versatile Winger (Defensive)  (3,324 player-seasons, 24%)
  Examples: J. Bowen, R. Pereyra, A. Elanga, T. Walcott, C. Hudson-Odoi
    v Box threat: -0.44
    v Dribbling: -0.41
    v Hold-up play: -0.35
    v Involvement: -0.43
    v Passing quality: -0.55
    v Progression: -0.51
    v Providing teammates: -0.49
    v Run quality: -0.34

  Creative Winger  (1,863 player-seasons, 14%)
  Examples: Mohamed Salah, B. Saka, Felipe Anderson, Gabriel Martinelli
    ^ Box threat: +0.73
    ^ Composure: +0.36
    ^ Dribbling: +0.78
    ^ Effectiveness: +0.35
    ^ Finishing: +0.36
    ^ Hold-up play: +0.53
    ^ Involvement: +0.74
    ^ Passing quality: +1.24
    ^ Progression: +1.13
    ^ Providing teammates: +1.18
    ^ Run quality: +0.72

  Inside Forward  (2,472 player-seasons, 18%)
  Examples: Mohamed Salah, S. Mané, H. Barnes
    ^ Aerial threat: +0.32
    ^ Box threat: +0.86
    ^ Effectiveness: +0.40
    ^ Finishing: +0.76
    ^ Poaching: +0.85
   

## Strikers: Poachers, Target Men, and Playmakers

The number 9 role is the most debated in football. For years, elite teams tried to play without a traditional striker (the "false nine" era). But the pendulum has swung back — Erling Haaland’s arrival at Manchester City proved that a pure goal machine can fit into even the most possession-based systems.

Today’s strikers broadly fall into:

- **Clinical finishers** who convert chances at elite rates (Haaland, Robert Lewandowski)
- **Target men** who use physicality and aerial prowess (Romelu Lukaku, Alexander Isak)
- **Playmaking strikers** who drop deep and create (Harry Kane, Lautaro Martinez)
- **Pressing forwards** who lead the press from the front (Roberto Firmino, Kai Havertz)
- **Mobile strikers** who run the channels (Jamie Vardy-style)

In [22]:
st_data = df[df["from_position"] == "Striker"].copy()
st_results = cluster_position(st_data, POSITION_QUALITIES["Striker"], forced_k=4)
st_df, st_names = show_position_results(st_data, st_results, "Striker")
plotly_radar_grid(st_results, st_names, "Striker")


  STRIKER: 4 ARCHETYPES IDENTIFIED

  Playmaking Striker  (1,644 player-seasons, 16%)
  Examples: H. Kane, R. Jiménez
    ^ Active defence: +0.49
    v Aerial threat: -0.37
    ^ Composure: +0.46
    v Defensive heading: -0.33
    ^ Dribbling: +0.63
    ^ Intelligent defence: +0.42
    ^ Involvement: +0.78
    ^ Passing quality: +1.20
    v Poaching: -0.36
    ^ Pressing: +0.37
    ^ Progression: +1.19
    ^ Providing teammates: +0.95
    ^ Run quality: +0.44

  Clinical Finisher  (2,518 player-seasons, 24%)
  Examples: O. Watkins, P. Bamford, J. Vardy, M. Antonio
    ^ Box threat: +0.90
    ^ Effectiveness: +0.46
    ^ Finishing: +0.69
    ^ Poaching: +0.71
    ^ Run quality: +0.42

  Complete Forward (Defensive)  (2,939 player-seasons, 28%)
  Examples: A. Mitrović, I. Toney, D. Solanke, T. Deeney
    ^ Active defence: +0.39
    ^ Aerial threat: +0.67
    ^ Defensive heading: +0.62
    ^ Hold-up play: +0.30
    ^ Intelligent defence: +0.32
    ^ Pressing: +0.36
    ^ Winning duels: +

## The Full Picture

Now let’s combine every position’s archetypes into a single summary. This gives us a bird’s-eye view of the archetype landscape across modern football.

In [23]:
all_results = []
position_summaries = []

for pos, (pos_df, arch_names) in [
    ("Central Defender", (cb_df, cb_names)),
    ("Full Back",        (fb_df, fb_names)),
    ("Midfielder",       (mid_df, mid_names)),
    ("Winger",           (wing_df, wing_names)),
    ("Striker",          (st_df, st_names)),
]:
    pos_out = pos_df[[
        "wy_player_id", "player_name", "from_position", "from_team_id",
        "from_season", "from_competition", "from_Minutes",
        "archetype_id", "archetype"
    ]].copy()

    # Attach GMM probabilities
    for col in pos_df.columns:
        if col.startswith("prob_"):
            pos_out[col] = pos_df[col]

    all_results.append(pos_out)

    for aid, name in arch_names.items():
        n = (pos_df["archetype_id"] == aid).sum()
        position_summaries.append({
            "position": pos, "archetype": name, "n_players": n
        })

final_df = pd.concat(all_results, ignore_index=True)
summary  = pd.DataFrame(position_summaries)

print(f"Total player-seasons with archetypes: {len(final_df):,}\n")
print(f"Archetypes per position:")
for pos in POSITIONS:
    pos_archs = summary[summary["position"] == pos]
    total = pos_archs["n_players"].sum()
    print(f"\n  {pos} ({total:,} player-seasons):")
    for _, row in pos_archs.iterrows():
        pct = row["n_players"] / total * 100
        print(f"    {row['archetype']:35s}  {row['n_players']:>6,}  ({pct:.0f}%)")

Total player-seasons with archetypes: 87,692

Archetypes per position:

  Central Defender (19,739 player-seasons):
    Ball-Playing CB                       2,975  (15%)
    No-Nonsense CB (Passing)              4,531  (23%)
    No-Nonsense CB (Effectiveness)        6,057  (31%)
    Aggressive Presser CB                 6,176  (31%)

  Full Back (17,764 player-seasons):
    High-Energy Full Back                 5,799  (33%)
    Balanced Full Back                    7,643  (43%)
    Creative Full Back                    4,322  (24%)

  Midfielder (26,148 player-seasons):
    Balanced Midfielder (Intelligent)     7,589  (29%)
    Deep Playmaker                        5,115  (20%)
    Balanced Midfielder (Run)             4,999  (19%)
    Goal-Threat Midfielder                2,844  (11%)
    Destroyer/Anchor                      5,601  (21%)

  Winger (13,635 player-seasons):
    Versatile Winger (Defensive)          3,324  (24%)
    Creative Winger                       1,863  (14%)
  

In [24]:
# Stacked bar chart: archetype distribution per position
fig = go.Figure()

all_archetypes_added = set()

for pos in POSITIONS:
    pos_archs = summary[summary["position"] == pos].sort_values("n_players", ascending=False)
    for i, (_, row) in enumerate(pos_archs.iterrows()):
        show_legend = row["archetype"] not in all_archetypes_added
        all_archetypes_added.add(row["archetype"])

        fig.add_trace(go.Bar(
            x=[pos],
            y=[row["n_players"]],
            name=row["archetype"],
            marker_color=ARCHETYPE_COLORS[i % len(ARCHETYPE_COLORS)],
            showlegend=show_legend,
            legendgroup=row["archetype"],
            hovertemplate=f"{row['archetype']}: {row['n_players']:,}<extra></extra>",
        ))

fig.update_layout(
    barmode='stack',
    height=500, width=950,
    template='plotly_white',
    title_text="Archetype Distribution by Position",
    yaxis_title="Player-Seasons",
    xaxis_title="Position",
    legend=dict(font=dict(size=9), x=1.02, y=1),
)
fig.show()

## Famous Players: Role Evolution Over Time

One of the most powerful features of GMM-based archetypes is that we can track how a player’s role *evolves* across seasons. A player is not just "a Clinical Finisher" — they might be 70% Clinical Finisher in one season, then shift towards 50% Clinical Finisher / 30% Playmaking Striker the next.

This captures real tactical evolution:
- **Harry Kane** transitioned from a pure finisher to a playmaking striker over time
- **Trent Alexander-Arnold** evolved from an overlapping full back to a creative/inverted role
- **Kevin De Bruyne** might shift between deep playmaker and goal-threat midfielder depending on Guardiola’s setup

Below we show stacked-bar evolution charts for up to 6 well-known players, showing how their archetype probabilities change season by season.

In [25]:
# Famous players - direct PID lookup (no fuzzy name matching)
FAMOUS_PLAYERS = {
    "Mohamed Salah": {"pid": 120353, "position": "Winger"},
    "E. Haaland": {"pid": 427097, "position": "Striker"},
    "K. De Bruyne": {"pid": 38021, "position": "Midfielder"},
    "V. van Dijk": {"pid": 370, "position": "Central Defender"},
    "T. Alexander-Arnold": {"pid": 346101, "position": "Full Back"},
    "B. Saka": {"pid": 520291, "position": "Winger"},
    "H. Kane": {"pid": 8717, "position": "Striker"},
    "Son Heung-Min": {"pid": 14911, "position": "Winger"},
    "C. Palmer": {"pid": 522051, "position": "Midfielder"},
    "Rodri": {"pid": 364860, "position": "Midfielder"},
    "D. Rice": {"pid": 379209, "position": "Midfielder"},
    "Vinícius Júnior": {"pid": 493295, "position": "Winger"},
    "K. Mbappé": {"pid": 353833, "position": "Striker"},
}

found_players = []
for display_name, info in FAMOUS_PLAYERS.items():
    pid = info["pid"]
    pos = info["position"]
    matches = final_df[final_df["wy_player_id"] == pid]
    if len(matches) > 0:
        pname = matches.iloc[0]["player_name"]
        n_seasons = matches["from_season"].nunique()
        if n_seasons >= 2:
            found_players.append({
                "name": display_name, "pid": pid,
                "position": pos, "n_seasons": n_seasons,
            })
            print(f"  Found: {display_name} (pid={pid}, {pos}, {n_seasons} seasons)")
    else:
        print(f"  NOT FOUND: {display_name} (pid={pid})")

# Pick top 6 with most seasons
found_df = pd.DataFrame(found_players).drop_duplicates(subset="pid")
found_df = found_df.sort_values("n_seasons", ascending=False)
showcase_players = found_df.head(6)
n_show = len(showcase_players)

print(f"\nShowcasing {n_show} players for role evolution charts.")

  Found: Mohamed Salah (pid=120353, Winger, 7 seasons)
  Found: E. Haaland (pid=427097, Striker, 6 seasons)
  Found: K. De Bruyne (pid=38021, Midfielder, 6 seasons)
  Found: V. van Dijk (pid=370, Central Defender, 5 seasons)
  Found: T. Alexander-Arnold (pid=346101, Full Back, 6 seasons)
  Found: B. Saka (pid=520291, Winger, 6 seasons)
  Found: H. Kane (pid=8717, Striker, 7 seasons)
  Found: Son Heung-Min (pid=14911, Winger, 7 seasons)
  Found: C. Palmer (pid=522051, Midfielder, 5 seasons)
  Found: Rodri (pid=364860, Midfielder, 5 seasons)
  Found: D. Rice (pid=379209, Midfielder, 7 seasons)
  Found: Vinícius Júnior (pid=493295, Winger, 6 seasons)
  Found: K. Mbappé (pid=353833, Striker, 5 seasons)

Showcasing 6 players for role evolution charts.


In [26]:
# Role probability evolution - stacked bar charts (one subplot per player)
n_show = len(showcase_players)
n_cols = 3
n_rows = max(1, (n_show + n_cols - 1) // n_cols)

fig = make_subplots(
    rows=n_rows, cols=n_cols,
    subplot_titles=showcase_players["name"].tolist(),
    shared_yaxes=True,
    vertical_spacing=0.15,
    horizontal_spacing=0.06,
)

legend_added = set()

for idx, (_, player) in enumerate(showcase_players.iterrows()):
    r = idx // n_cols + 1
    c = idx % n_cols + 1

    p_data = final_df[final_df["wy_player_id"] == player["pid"]].sort_values("from_season")
    prob_cols = [col for col in p_data.columns if col.startswith("prob_")]

    if len(prob_cols) == 0 or len(p_data) == 0:
        continue

    # Format season labels: "18/19", "19/20", etc.
    season_labels = [
        f"{int(s) % 100:02d}/{(int(s) + 1) % 100:02d}"
        for s in p_data["from_season"].values
    ]

    for i, col_name in enumerate(prob_cols):
        archetype_name = col_name.replace("prob_", "")
        probs = p_data[col_name].values * 100
        color = ARCHETYPE_COLORS[i % len(ARCHETYPE_COLORS)]

        show_legend = archetype_name not in legend_added
        if show_legend:
            legend_added.add(archetype_name)

        # Add percentage text inside bars where space allows (>= 15%)
        text_vals = [f"{v:.0f}%" if v >= 15 else "" for v in probs]

        fig.add_trace(
            go.Bar(
                x=season_labels,
                y=probs,
                name=archetype_name,
                marker_color=color,
                showlegend=show_legend,
                legendgroup=archetype_name,
                text=text_vals,
                textposition="inside",
                textfont=dict(size=9, color="white"),
                hovertemplate=f"{archetype_name}: %{{y:.0f}}%<extra>{player['name']}</extra>",
            ),
            row=r, col=c,
        )

fig.update_layout(
    barmode='stack',
    height=300 * n_rows,
    width=1100,
    template='plotly_white',
    title_text="How Player Roles Evolve Over Seasons (GMM Probabilities)",
    legend=dict(
        orientation="h", yanchor="bottom", y=-0.18,
        xanchor="center", x=0.5, font=dict(size=9),
    ),
)
fig.update_yaxes(range=[0, 100], title_text="Role probability (%)", col=1)
fig.update_xaxes(tickangle=45, tickfont=dict(size=8))
fig.show()

## K-Means vs. GMM: Which Approach Is Better?

Both methods are included for the supervisor to review. Here is a concise comparison:

| Aspect | K-Means | GMM |
|--------|---------|-----|
| **Assignment** | Hard — each player belongs to exactly one archetype | Soft — each player has a probability for every archetype |
| **Shape of archetypes** | Spherical (equal weight in all directions) | Ellipsoidal (can capture correlations between qualities) |
| **Interpretability** | Simpler — "Player X is a Ball-Playing CB" | Richer — "Player X is 72% Ball-Playing CB, 20% Reading-the-Game CB" |
| **For transfer modelling** | Easy to use as a categorical feature | Probability columns can be used directly as continuous features |
| **Robustness** | More stable with small clusters | Can overfit with too many components or small data |

**Recommendation**: Use GMM probabilities as features in the transfer model. The soft assignment captures the reality that most players are *blends* of archetypes, not pure types. The K-Means labels are useful for communication and labelling (e.g., dashboard filters).

## Export

Save the enriched dataset with archetype labels and GMM probabilities for downstream use in the transfer model.

In [27]:
out_path = PROCESSED / "player_archetypes" / "player_archetypes.parquet"
final_df.to_parquet(out_path, index=False)

print(f"Saved: {out_path}")
print(f"Shape: {final_df.shape}")
print(f"\nColumns:")
for c in final_df.columns:
    print(f"  {c}")
print(f"\nSample archetype distribution:")
print(final_df["archetype"].value_counts().head(10))

Saved: /Users/jorgepadilla/Documents/Documents - Jorge’s MacBook Air/thesis_data/raw_data/Transfers/player_archetypes.parquet
Shape: (87692, 30)

Columns:
  wy_player_id
  player_name
  from_position
  from_team_id
  from_season
  from_competition
  from_Minutes
  archetype_id
  archetype
  prob_Ball-Playing CB
  prob_No-Nonsense CB (Passing)
  prob_No-Nonsense CB (Effectiveness)
  prob_Aggressive Presser CB
  prob_High-Energy Full Back
  prob_Balanced Full Back
  prob_Creative Full Back
  prob_Balanced Midfielder (Intelligent)
  prob_Deep Playmaker
  prob_Balanced Midfielder (Run)
  prob_Goal-Threat Midfielder
  prob_Destroyer/Anchor
  prob_Versatile Winger (Defensive)
  prob_Creative Winger
  prob_Inside Forward
  prob_Versatile Winger (Progression)
  prob_Pressing Winger
  prob_Playmaking Striker
  prob_Clinical Finisher
  prob_Complete Forward (Defensive)
  prob_Complete Forward (Finishing)

Sample archetype distribution:
archetype
Balanced Full Back                   7643
Balanced

## Assumptions & Methodological Notes

| # | Assumption | Rationale |
|---|-----------|-----------|
| 1 | **Division 1 only** | We restrict to top-flight leagues to ensure quality scores are comparable. Second-division players face different opposition quality. |
| 2 | **No goalkeepers** | Goalkeepers have a fundamentally different quality profile. They are excluded entirely. |
| 3 | **20 Twelve Football quality scores** | These are the core player evaluation dimensions provided by Twelve Football’s framework. |
| 4 | **Position-specific quality exclusions** | Qualities with <80% data coverage for a position are excluded to avoid distortion from missing data (e.g., Poaching excluded for CBs). |
| 5 | **Both K-Means and GMM presented** | K-Means gives clean labels; GMM gives richer probabilities. Both are included for the supervisor to decide which to use downstream. |
| 6 | **Number of archetypes auto-selected** | For each position, we test k=3 to 7 and pick the largest k where the smallest archetype still contains at least 8% of players. This avoids micro-clusters. |
| 7 | **De-duplication** | First occurrence per (player, team, season) is kept. A player who transferred mid-season appears once per club-season. |
| 8 | **Minimum 500 minutes** | Already enforced in the source data — only player-seasons with at least 500 minutes are included. |
| 9 | **Re-standardisation before clustering** | Quality scores are standardised (mean=0, std=1) within each position group before clustering, so that all qualities contribute equally. |

In [28]:
# Assumptions checklist - verify key filters
print("ASSUMPTIONS CHECKLIST")
print("=" * 50)

# 1. Division 1 only
n_comps = final_df["from_competition"].nunique()
print(f"[OK] Division 1 only: {n_comps} competitions")

# 2. No goalkeepers
gk_count = (final_df["from_position"] == "Goalkeeper").sum()
print(f"[OK] No goalkeepers: {gk_count} goalkeeper rows (should be 0)")

# 3. Positions covered
for pos in POSITIONS:
    n = (final_df["from_position"] == pos).sum()
    n_arch = final_df[final_df["from_position"] == pos]["archetype"].nunique()
    print(f"[OK] {pos}: {n:,} player-seasons, {n_arch} archetypes")

# 4. GMM probabilities present
prob_cols = [c for c in final_df.columns if c.startswith("prob_")]
print(f"[OK] GMM probability columns: {len(prob_cols)}")

# 5. Season range
print(f"[OK] Seasons: {sorted(final_df['from_season'].unique())}")

print("\nAll checks passed.")

ASSUMPTIONS CHECKLIST
[OK] Division 1 only: 128 competitions
[OK] No goalkeepers: 0 goalkeeper rows (should be 0)
[OK] Central Defender: 19,739 player-seasons, 4 archetypes
[OK] Full Back: 17,764 player-seasons, 3 archetypes
[OK] Midfielder: 26,148 player-seasons, 5 archetypes
[OK] Winger: 13,635 player-seasons, 5 archetypes
[OK] Striker: 10,406 player-seasons, 4 archetypes
[OK] GMM probability columns: 21
[OK] Seasons: [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024), np.int16(2025)]

All checks passed.
