# Decoding Team Tactical DNA

Every football team has a tactical fingerprint — a unique combination of how they defend, transition, attack, and create chances. By mapping six dimensions of playing style, we can identify the **tactical families** of world football.

This analysis covers **17,000+ team-seasons** across **179 top-flight leagues** worldwide (2014–2025). Only first-division leagues are included to keep the comparison meaningful.

**Why this matters for transfers:** When a player moves clubs, the tactical environment changes. Understanding a team’s tactical DNA lets us predict how well a player fits their new team’s style.

In [12]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.patches import Patch, FancyBboxPatch
from matplotlib.lines import Line2D
import matplotlib.patheffects as pe
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# -- Style --
plt.rcParams.update({
    'figure.facecolor': '#FAFAFA',
    'axes.facecolor': '#FAFAFA',
    'axes.grid': False,
    'font.family': 'sans-serif',
    'font.size': 11,
    'axes.titlesize': 14,
    'axes.titleweight': 'bold',
})

FAMILY_COLORS = ['#1B4F72', '#922B21', '#1E8449', '#B7950B', '#6C3483']

# -- Paths --
docs = Path("/Users/jorgepadilla/Documents")
for d in docs.iterdir():
    if "Jorge" in d.name and "MacBook" in d.name and d.is_dir():
        RAW = d / "thesis_data" / "raw_data"
        PROCESSED = d / "thesis_data" / "processed_data"
        break

ts = pd.read_parquet(RAW / "Teams_stats" / "team_stats_season.parquet")
comps = pd.read_parquet(RAW / "Wyscout" / "competitions_wyscout.parquet")

div1_ids = comps[comps["division"] == 1]["competition_id"].unique()
ts = ts[ts["competition_id"].isin(div1_ids)].copy()

print(f"Loaded: {len(ts):,} team-seasons from {ts.competition_id.nunique()} top-flight leagues")
print(f"Seasons: {ts.season.min()}–{ts.season.max()}")

Loaded: 17,262 team-seasons from 173 top-flight leagues
Seasons: 2014–2026


## The Six Dimensions of Playing Style

Every team’s tactical DNA is captured by six style dimensions plus one outcome measure. Think of them as the six sides of a radar chart that describes *how* a team plays, separate from *how well* they perform.

> **Important:** These are *style* dimensions, not quality ratings. A low "Chance Creation" score doesn't mean a team creates few chances — it means they create chances through **sustained possession** (like Man City). A high score means they create through **direct, fast attacks** (like Wolves). Same applies to all six dimensions — they measure *how*, not *how well*.

| Dimension | What it measures | High score example | Low score example |
|-----------|------------------|--------------------|-------------------|
| **Defence** | Pressing intensity, defensive line height, where the ball is won | Man City’s high press | Burnley sitting deep |
| **Def. Transition** | Speed of reaction after losing the ball | Liverpool’s counter-press | Teams that retreat slowly |
| **Att. Transition** | Directness after winning the ball — fast breaks vs. secure possession | Counter-attacking sides | Teams that slow down after recovery |
| **Attack** | How the team builds up — long balls vs. short passes, bypassing midfield | Direct long-ball teams | Possession-based build-up |
| **Penetration** | How the final third is entered — carries vs. crosses | Dribble-heavy sides | Cross-heavy sides |
| **Chance Creation** | How shots are generated — quick direct attacks vs. sustained possession | Fast-break finishers | Patient build-up sides |

A seventh dimension, **Outcome**, measures results (expected points and actual points). It is kept separate so that style clustering is purely about *how* teams play, not how many points they collect.

In [13]:
# -- Define the seven tactical qualities --
QUALITIES = {
    "Defence": {
        "metrics": {
            "defensive_intensity": {"weight": 1.0, "higher_is_better": True},
            "ppda": {"weight": 1.0, "higher_is_better": False},
            "final_third_recoveries_pct": {"weight": 1.0, "higher_is_better": True},
            "defensive_action_height_m": {"weight": 1.0, "higher_is_better": True},
        },
    },
    "Def. Transition": {
        "metrics": {
            "recoveries_within_5s_pct": {"weight": 1.0, "higher_is_better": True},
            "time_to_defensive_action_after_loss_att_half_s": {"weight": 2.0, "higher_is_better": False},
            "time_to_defensive_action_after_loss_own_half_s": {"weight": 1.0, "higher_is_better": False},
        },
    },
    "Att. Transition": {
        "metrics": {
            "possessions_retained_after_5s_pct": {"weight": 0.5, "higher_is_better": False},
            "final_third_entry_within_10s_after_recovery_own_half_pct": {"weight": 0.5, "higher_is_better": True},
            "first_pass_forward_after_recovery_own_half_pct": {"weight": 1.0, "higher_is_better": True},
            "median_time_to_first_forward_pass_own_half_s": {"weight": 0.5, "higher_is_better": False},
        },
    },
    "Attack": {
        "metrics": {
            "long_ball_pct": {"weight": 2.0, "higher_is_better": True},
            "forward_passes_from_middle_third_pct": {"weight": 1.0, "higher_is_better": True},
            "buildups_from_goalkicks_pct": {"weight": 1.0, "higher_is_better": False},
        },
    },
    "Penetration": {
        "metrics": {
            "box_entries_from_carries_pct": {"weight": 2.0, "higher_is_better": True},
            "box_entries_from_crosses_pct": {"weight": 2.0, "higher_is_better": False},
            "crosses_per_final_third_possession": {"weight": 1.0, "higher_is_better": False},
        },
    },
    "Chance Creation": {
        "metrics": {
            "shots_per_final_third_pass": {"weight": 1.0, "higher_is_better": True},
            "shots_from_direct_attacks_pct": {"weight": 2.0, "higher_is_better": True},
            "shots_from_sustained_attacks_pct": {"weight": 2.0, "higher_is_better": False},
        },
    },
    "Outcome": {
        "metrics": {
            "xpts": {"weight": 1.5, "higher_is_better": True},
            "points": {"weight": 1.0, "higher_is_better": True},
        },
    },
}

# -- Compute tactical scores within (competition, season) groups --
all_metrics = []
for q_info in QUALITIES.values():
    all_metrics.extend(q_info["metrics"].keys())
all_metrics = list(set(all_metrics))

# Check which metrics exist
missing = [m for m in all_metrics if m not in ts.columns]
if missing:
    print(f"WARNING -- missing columns: {missing}")

# Z-score within each (competition_id, season) group, min 3 teams
def compute_group_zscores(group, metrics):
    """Standardise metrics within a league-season group."""
    if len(group) < 3:
        return None
    result = group.copy()
    for m in metrics:
        if m in result.columns:
            col = result[m]
            mu, sigma = col.mean(), col.std()
            if sigma > 0:
                result[f"z_{m}"] = (col - mu) / sigma
            else:
                result[f"z_{m}"] = 0.0
    return result

groups = []
for (cid, season), grp in ts.groupby(["competition_id", "season"]):
    out = compute_group_zscores(grp, all_metrics)
    if out is not None:
        groups.append(out)

qdf = pd.concat(groups, ignore_index=True)
print(f"After z-scoring: {len(qdf):,} team-seasons (dropped groups with <3 teams)")

# -- Weighted average per quality --
for quality_name, q_info in QUALITIES.items():
    total_w = 0.0
    qdf[quality_name] = 0.0
    for metric, meta in q_info["metrics"].items():
        zcol = f"z_{metric}"
        if zcol not in qdf.columns:
            continue
        sign = 1.0 if meta["higher_is_better"] else -1.0
        w = meta["weight"]
        qdf[quality_name] += sign * w * qdf[zcol]
        total_w += w
    if total_w > 0:
        qdf[quality_name] /= total_w

style_dims = [q for q in QUALITIES if q != "Outcome"]
print(f"\nStyle dimensions: {style_dims}")

# Display names for radar axes and visualizations (showing both poles)
DISPLAY_NAMES = {
    "Defence": "Low Block \u2192 High Press",
    "Def. Transition": "Regroup \u2192 Counter-Press",
    "Att. Transition": "Build-Up \u2192 Counter-Attack",
    "Attack": "Short Passing \u2192 Direct/Long",
    "Penetration": "Crosses \u2192 Carries",
    "Chance Creation": "Sustained \u2192 Direct Chances",
}

# Short display names (for tight spaces like radar axes)
SHORT_DISPLAY = list(DISPLAY_NAMES.keys())
# Full display names (for heatmap columns, evolution charts)
FULL_DISPLAY = [DISPLAY_NAMES[d] for d in style_dims]
print(f"Sample quality scores:\n{qdf[list(QUALITIES.keys())].describe().round(2)}")

After z-scoring: 17,196 team-seasons (dropped groups with <3 teams)

Style dimensions: ['Defence', 'Def. Transition', 'Att. Transition', 'Attack', 'Penetration', 'Chance Creation']
Sample quality scores:
        Defence  Def. Transition  Att. Transition    Attack  Penetration  \
count  17196.00         17192.00         17196.00  17194.00     17194.00   
mean       0.00            -0.00            -0.00     -0.00        -0.00   
std        0.80             0.70             0.67      0.79         0.78   
min       -4.11            -5.56            -3.24     -2.78        -3.98   
25%       -0.55            -0.45            -0.44     -0.55        -0.54   
50%        0.01             0.04             0.02     -0.00        -0.03   
75%        0.56             0.48             0.46      0.54         0.53   
max        2.88             2.33             2.52      3.18         5.11   

       Chance Creation   Outcome  
count         17185.00  17196.00  
mean              0.00      0.00  
std   

## Reading a Tactical Radar

Each team’s style is plotted on a **radar chart** with six axes — one per dimension. A score of **0** means the team is average *within its own league and season*. Positive values mean the team scores higher on that dimension; negative values mean lower.

This league-relative approach is essential: a pressing team in Albania and a pressing team in England both get high Defence scores — we are comparing *style within context*, not raw numbers across vastly different leagues.

Let’s look at three very different tactical profiles to see how the radar works.

In [14]:
# -- Three contrasting tactical profiles --
examples = [
    (1625, 2024, 'Man City 24/25', '#2563EB'),
    (1646, 2023, 'Burnley 23/24', '#DC2626'),
    (1614, 2024, 'Aston Villa 24/25', '#7C3AED'),
]

fig = make_subplots(
    rows=1, cols=3,
    specs=[[{'type': 'polar'}] * 3],
    subplot_titles=[e[2] for e in examples],
)

for i, (tid, season, label, color) in enumerate(examples):
    row = qdf[(qdf['team_id'] == tid) & (qdf['season'] == season)]
    if len(row) == 0:
        print(f'WARNING: no data for {label} (team_id={tid}, season={season})')
        continue
    vals = row[style_dims].values[0]
    vals_closed = list(vals) + [vals[0]]
    display_closed = FULL_DISPLAY + [FULL_DISPLAY[0]]
    dims_closed = display_closed

    fig.add_trace(go.Scatterpolar(
        r=vals_closed,
        theta=dims_closed,
        fill='toself',
        fillcolor=color,
        opacity=0.15,
        line=dict(color=color, width=2.5),
        name=label,
        hovertemplate='%{theta}: %{r:.2f}<extra>' + label + '</extra>',
    ), row=1, col=i + 1)

fig.update_polars(
    radialaxis=dict(range=[-2, 2], showticklabels=True, tickfont=dict(size=9)),
)
fig.update_layout(
    height=420, width=1100, showlegend=False,
    template='plotly_white',
    title_text='Three Ways to Play Football',
    title_font_size=18,
    margin=dict(t=80),
)
fig.show()

## Finding Tactical Families

Now the key question: **are there natural groupings** among these 17,000+ team-seasons?

The approach:
1. Draw every team’s tactical radar.
2. Measure how similar two radars look (comparing the shape of each hexagon).
3. Group teams whose radars look alike into **tactical families**.

We use K-Means clustering — essentially grouping teams by the similarity of their tactical fingerprint. We test several values of *k* (number of families) and pick **k = 5** because it gives the most interpretable football groupings.

In [15]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

qdf_clean = qdf.dropna(subset=style_dims).copy()
X_raw = qdf_clean[style_dims].values
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)

# -- Test k = 3 to 8 --
print(f"Clustering {len(qdf_clean):,} team-seasons...\n")
for k in range(3, 9):
    km = KMeans(n_clusters=k, n_init=20, random_state=42)
    labels = km.fit_predict(X)
    sizes = np.bincount(labels)
    print(f"  k={k}: sizes = {sorted(sizes, reverse=True)}")

# -- k = 5 for football interpretability --
BEST_K = 5
km_final = KMeans(n_clusters=BEST_K, n_init=20, random_state=42)
qdf_clean["family_id"] = km_final.fit_predict(X)

# -- Name each family based on its tactical profile --
def name_family(profile, dims):
    """Assign a football-readable name based on the family average profile."""
    s = {d: profile[d] for d in dims}
    if s['Defence'] > 0.4 and s['Def. Transition'] > 0.3:
        if s['Attack'] < -0.3:
            return "Possession & Press"
        return "Total Football"
    if s['Attack'] > 0.3 and s['Chance Creation'] > 0.3:
        if s['Defence'] < -0.3:
            return "Direct & Reactive"
        return "All-Action Intensity"
    if s['Penetration'] > 0.4:
        return "Progressive Carriers"
    if s['Penetration'] < -0.4:
        return "Wing & Cross"
    if s['Att. Transition'] > 0.3 and s['Defence'] < -0.3:
        return "Counter-Attack"
    return "Pragmatic Mid-Block"

FAMILY_NAMES = {}
family_profiles = {}
for fid in range(BEST_K):
    mask = qdf_clean["family_id"] == fid
    family_profiles[fid] = qdf_clean.loc[mask, style_dims].mean()
    FAMILY_NAMES[fid] = name_family(family_profiles[fid], style_dims)

qdf_clean["family_name"] = qdf_clean["family_id"].map(FAMILY_NAMES)

# -- Summary --
print(f"\n{'Family':<25s} {'N':>6s} {'%':>6s}  Profile")
print("-" * 90)
for fid in range(BEST_K):
    mask = qdf_clean["family_id"] == fid
    n = mask.sum()
    pct = n / len(qdf_clean) * 100
    prof = " | ".join(f"{d}: {family_profiles[fid][d]:+.2f}" for d in style_dims)
    outcome = qdf_clean.loc[mask, "Outcome"].mean()
    print(f"{FAMILY_NAMES[fid]:<25s} {n:>6,d} {pct:>5.1f}%  {prof}  | Outcome: {outcome:+.2f}")

Clustering 17,178 team-seasons...

  k=3: sizes = [np.int64(6037), np.int64(5679), np.int64(5462)]
  k=4: sizes = [np.int64(5183), np.int64(4600), np.int64(3886), np.int64(3509)]
  k=5: sizes = [np.int64(3922), np.int64(3661), np.int64(3293), np.int64(3231), np.int64(3071)]
  k=6: sizes = [np.int64(3571), np.int64(3110), np.int64(3024), np.int64(2554), np.int64(2514), np.int64(2405)]
  k=7: sizes = [np.int64(2969), np.int64(2887), np.int64(2602), np.int64(2406), np.int64(2227), np.int64(2166), np.int64(1921)]
  k=8: sizes = [np.int64(2721), np.int64(2715), np.int64(2694), np.int64(2111), np.int64(1939), np.int64(1909), np.int64(1682), np.int64(1407)]

Family                         N      %  Profile
------------------------------------------------------------------------------------------
Progressive Carriers       3,293  19.2%  Defence: -0.42 | Def. Transition: -0.39 | Att. Transition: -0.19 | Attack: -0.43 | Penetration: +0.70 | Chance Creation: +0.02  | Outcome: -0.14
Possession & P

## The Tactical Families

Each family represents a distinct *way of playing football*. The radar below shows the average profile of each family — the shape tells you immediately what kind of football these teams play.

Individual radars let us compare each family’s shape without overlap. The colour matches are consistent throughout the notebook.

In [16]:
# -- One radar per family in a 2x3 grid --
subplot_titles = [
    f"{FAMILY_NAMES.get(i, '')}  ({(qdf_clean['family_id'] == i).sum() / len(qdf_clean) * 100:.0f}%)"
    for i in range(BEST_K)
] + ['']  # empty 6th slot

fig = make_subplots(
    rows=2, cols=3,
    specs=[[{'type': 'polar'}] * 3, [{'type': 'polar'}] * 3],
    subplot_titles=subplot_titles,
    vertical_spacing=0.12,
    horizontal_spacing=0.08,
)

for fid in range(BEST_K):
    row = fid // 3 + 1
    col = fid % 3 + 1
    profile = family_profiles[fid]
    vals = profile[style_dims].values
    vals_closed = list(vals) + [vals[0]]
    dims_closed = style_dims + [style_dims[0]]  # short names for tight radar grid
    n = (qdf_clean["family_id"] == fid).sum()
    pct = n / len(qdf_clean) * 100
    outcome = qdf_clean.loc[qdf_clean["family_id"] == fid, "Outcome"].mean()

    fig.add_trace(go.Scatterpolar(
        r=vals_closed,
        theta=dims_closed,
        fill='toself',
        fillcolor=FAMILY_COLORS[fid],
        opacity=0.2,
        line=dict(color=FAMILY_COLORS[fid], width=3),
        name=f"{FAMILY_NAMES[fid]} ({pct:.0f}%)",
        hovertemplate='%{theta}: %{r:.2f}<extra>' + FAMILY_NAMES[fid] + '</extra>',
    ), row=row, col=col)

    # Add family name + stats annotation BELOW each radar
    fig.add_annotation(
        text=f"n={n:,} teams | Avg outcome: {outcome:+.2f}",
        xref="paper", yref="paper",
        x=(col - 1) / 3 + 1 / 6,
        y=1.0 - (row - 1) * 0.55 - 0.50,
        showarrow=False,
        font=dict(size=9, color='#555'),
    )

fig.update_polars(
    radialaxis=dict(range=[-1.5, 1.5], showticklabels=True, tickfont=dict(size=7)),
    angularaxis=dict(tickfont=dict(size=9)),
)
fig.update_layout(
    height=800, width=1200,
    showlegend=True,
    legend=dict(orientation='h', yanchor='bottom', y=-0.05, xanchor='center', x=0.5),
    template='plotly_white',
    title_text='The Tactical Families of World Football',
    title_font_size=18,
    margin=dict(t=80, b=60),
)
fig.show()

## Premier League: Who Plays Like Who?

Let’s zoom into the **Premier League** — the most-watched league in the world. The heatmap below shows every team’s tactical score across the six style dimensions. Teams are listed in **alphabetical order** — consistent across all seasons — so you can track the same team across years. Teams not present in a given season appear as blank rows. Use the season buttons to switch.

Hover over any cell to see the exact tactical score.

In [17]:
TEAM_NAMES = {
    1609: 'Arsenal', 1610: 'Chelsea', 1611: 'Man United',
    1612: 'Liverpool', 1613: 'Newcastle', 1614: 'Aston Villa',
    1616: 'Fulham', 1619: 'Southampton', 1620: 'West Brom',
    1623: 'Everton', 1624: 'Tottenham', 1625: 'Man City',
    1626: 'Watford', 1627: 'Swansea', 1628: 'Crystal Palace',
    1629: 'Wolves', 1630: 'Leeds', 1631: 'Leicester',
    1632: 'Sunderland', 1633: 'West Ham', 1634: 'Ipswich',
    1636: 'Sheffield Utd', 1639: 'Stoke', 1642: "Nott'm Forest",
    1644: 'Middlesbrough', 1646: 'Burnley', 1650: 'Huddersfield',
    1651: 'Brighton', 1659: 'Bournemouth', 1660: 'Luton',
    1669: 'Brentford', 1672: 'Cardiff', 1673: 'Norwich',
    10529: 'Sheffield Utd (2)', 10531: 'Coventry',
}

# Last 3 seasons available for EPL
epl_seasons = sorted(qdf_clean[qdf_clean["competition_id"] == 364]["season"].unique())
last_3 = epl_seasons[-3:]
print(f"EPL seasons shown: {last_3}")

epl = qdf_clean[(qdf_clean["competition_id"] == 364) & (qdf_clean["season"].isin(last_3))].copy()
epl["team_name"] = epl["team_id"].map(TEAM_NAMES).fillna(epl["team_id"].astype(str))

# -- Build a FIXED alphabetical y-axis across all 3 seasons --
all_teams_sorted = sorted(epl["team_name"].unique())

# Use DISPLAY_NAMES for x-axis columns
display_cols = [DISPLAY_NAMES[d] for d in style_dims]

fig = go.Figure()
buttons = []

for idx, season in enumerate(last_3):
    season_data = epl[epl["season"] == season].copy()
    season_data = season_data.set_index("team_name")

    # Build a full matrix for ALL teams (NaN for missing)
    z_matrix = np.full((len(all_teams_sorted), len(style_dims)), np.nan)
    hover_text = [['' for _ in style_dims] for _ in all_teams_sorted]
    display_text = [['' for _ in style_dims] for _ in all_teams_sorted]

    for i_row, team in enumerate(all_teams_sorted):
        if team in season_data.index:
            r = season_data.loc[team]
            # Handle case where team appears multiple times (shouldn't happen)
            if isinstance(r, pd.DataFrame):
                r = r.iloc[0]
            for j_col, dim in enumerate(style_dims):
                val = r[dim]
                z_matrix[i_row, j_col] = val
                display_text[i_row][j_col] = f"{val:.2f}"
                fam = r.get("family_name", "")
                hover_text[i_row][j_col] = (
                    f"{team}<br>{DISPLAY_NAMES[dim]}: {val:.2f}<br>Family: {fam}"
                )

    visible = idx == 0
    fig.add_trace(go.Heatmap(
        z=z_matrix,
        x=display_cols,
        y=all_teams_sorted,
        colorscale='RdBu_r',
        zmin=-2, zmax=2,
        text=display_text,
        texttemplate="%{text}",
        textfont=dict(size=10),
        hovertext=hover_text,
        hovertemplate='%{hovertext}<extra></extra>',
        visible=visible,
        colorbar=dict(title="Tactical<br>Score", tickvals=[-2, -1, 0, 1, 2]),
    ))

    # Button visibility
    vis = [False] * len(last_3)
    vis[idx] = True
    buttons.append(dict(
        label=f"{season}/{season + 1}",
        method="update",
        args=[{"visible": vis}],
    ))

fig.update_layout(
    updatemenus=[dict(
        type="buttons", direction="right",
        x=0.5, xanchor="center", y=1.12,
        buttons=buttons,
        bgcolor='#E8E8E8',
    )],
    height=700, width=1000,
    template='plotly_white',
    title_text='Premier League Tactical Heatmap',
    title_font_size=18,
    yaxis=dict(autorange='reversed'),
    margin=dict(l=140, t=100),
)
fig.show()

EPL seasons shown: [np.int64(2023), np.int64(2024), np.int64(2025)]


In [23]:
# -- Dimensionality reduction for tactical landscape --
try:
    import umap
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=30, min_dist=0.3)
    embedding = reducer.fit_transform(X)
    dim_label = "UMAP"
except ImportError:
    from sklearn.decomposition import PCA
    pca = PCA(n_components=2)
    embedding = pca.fit_transform(X)
    dim_label = "PCA"

print(f"Using {dim_label} for 2D projection")

# -- Background: all teams, faded --
fig = go.Figure()

for fid in range(BEST_K):
    mask = qdf_clean["family_id"].values == fid
    fig.add_trace(go.Scatter(
        x=embedding[mask, 0], y=embedding[mask, 1],
        mode='markers',
        marker=dict(color=FAMILY_COLORS[fid], size=3, opacity=0.08),
        name=FAMILY_NAMES.get(fid, f"Family {fid}"),
        hoverinfo='skip',
        showlegend=True,
    ))

# -- EPL teams highlighted with names --
season_symbols = {last_3[0]: 'circle', last_3[1]: 'square', last_3[2]: 'diamond'}
qdf_clean_idx = qdf_clean.index.tolist()

for season in last_3:
    s_data = epl[epl["season"] == season]
    for _, r in s_data.iterrows():
        try:
            idx = qdf_clean_idx.index(r.name)
        except ValueError:
            continue
        fig.add_trace(go.Scatter(
            x=[embedding[idx, 0]], y=[embedding[idx, 1]],
            mode='markers+text',
            marker=dict(
                color=FAMILY_COLORS[r["family_id"]],
                size=10,
                symbol=season_symbols.get(season, 'circle'),
                line=dict(color='black', width=1),
            ),
            text=[r["team_name"]],
            textposition='top center',
            textfont=dict(size=9, color='black'),
            name=f"{r['team_name']} {season}/{season+1}",
            showlegend=False,
            hovertemplate=(
                f"{r['team_name']} ({season}/{season+1})<br>"
                f"Family: {FAMILY_NAMES[r['family_id']]}<extra></extra>"
            ),
        ))

fig.update_layout(
    height=700, width=1000,
    template='plotly_white',
    title=f'Premier League in the Global Tactical Landscape ({dim_label})',
    title_font_size=18,
    xaxis_title=f'{dim_label} 1',
    yaxis_title=f'{dim_label} 2',
    legend=dict(orientation='h', yanchor='bottom', y=-0.12, xanchor='center', x=0.5),
)
fig.show()

Using UMAP for 2D projection


## How Teams Evolve

A team’s tactical DNA is not fixed. Managers change, players arrive and leave, and tactical trends sweep through the game. The charts below track how six Premier League teams have evolved their playing style season by season.

Look for:
- **Arsenal’s progression** under Arteta — gradually becoming more possession-oriented and press-heavy.
- **Brighton’s identity shifts** across different managers.
- **Man City’s consistency** — a tactical profile that barely moves year-on-year.

In [19]:
focus_teams = [
    (1609, 'Arsenal'), (1612, 'Liverpool'), (1625, 'Man City'),
    (1611, 'Man United'), (1614, 'Aston Villa'), (1651, 'Brighton'),
]

epl_all = qdf_clean[qdf_clean["competition_id"] == 364].copy()
epl_all["team_name"] = epl_all["team_id"].map(TEAM_NAMES).fillna(epl_all["team_id"].astype(str))

focus_ids = [t[0] for t in focus_teams]
focus_data = epl_all[epl_all["team_id"].isin(focus_ids)].copy()

# Melt to long format for plotly facets
melted = focus_data.melt(
    id_vars=["team_id", "team_name", "season"],
    value_vars=style_dims,
    var_name="dimension",
    value_name="score",
)
# Map short dimension names to bipolar display names for the chart
melted["dimension"] = melted["dimension"].map(DISPLAY_NAMES)

fig = px.line(
    melted, x="season", y="score", color="dimension",
    facet_col="team_name", facet_col_wrap=3,
    markers=True,
    color_discrete_sequence=FAMILY_COLORS + ['#6B7280'],
    title="Tactical Evolution of Premier League Teams",
)
fig.update_yaxes(range=[-2.5, 2.5])
fig.add_hline(y=0, line_dash="dash", line_color="gray", opacity=0.5)
fig.update_layout(
    height=600, width=1100,
    template='plotly_white',
    title_font_size=18,
    legend=dict(orientation='h', yanchor='bottom', y=-0.15, xanchor='center', x=0.5),
)
fig.show()

## Style vs Results

Does playing style predict results? Not entirely — tactics are only one part of the picture. But some styles are associated with better outcomes on average, simply because the best-resourced clubs tend to adopt certain approaches (high pressing, possession dominance) while less-resourced clubs gravitate towards reactive, direct football.

The chart below shows the average **Outcome score** (based on expected points and actual points) for each tactical family.

In [20]:
# -- Outcome by family --
outcomes = []
for fid in range(BEST_K):
    mask = qdf_clean["family_id"] == fid
    outcomes.append({
        "family": FAMILY_NAMES[fid],
        "outcome": qdf_clean.loc[mask, "Outcome"].mean(),
        "n": mask.sum(),
        "color": FAMILY_COLORS[fid],
    })
outcomes_df = pd.DataFrame(outcomes).sort_values("outcome")

fig = go.Figure(go.Bar(
    x=outcomes_df["outcome"],
    y=outcomes_df["family"],
    orientation='h',
    marker_color=outcomes_df["color"],
    text=outcomes_df["outcome"].apply(lambda x: f"{x:+.2f}"),
    textposition='outside',
))
fig.update_layout(
    title="Do Certain Playing Styles Win More?",
    title_font_size=18,
    xaxis_title="Average Outcome Score",
    height=400, width=700,
    template='plotly_white',
    margin=dict(l=180),
)
fig.show()

# -- Correlation between style and results --
print("\nCorrelation between style and results:")
for dim in style_dims:
    r = qdf_clean[dim].corr(qdf_clean["Outcome"])
    arrow = "→ wins more" if r > 0.1 else ("→ wins less" if r < -0.1 else "→ no clear link")
    print(f"  {dim:20s}  r={r:+.3f}  {arrow}")


Correlation between style and results:
  Defence               r=+0.579  → wins more
  Def. Transition       r=+0.276  → wins more
  Att. Transition       r=-0.156  → wins less
  Attack                r=-0.237  → wins less
  Penetration           r=+0.174  → wins more
  Chance Creation       r=-0.342  → wins less


## Export

Save the processed data for downstream use:
- **team_style_clusters.parquet** — team qualities + family assignments (cleaned rows only)
- **team_qualities.parquet** — full quality scores for all team-seasons (including rows dropped during clustering)

In [21]:
# -- Save clustered data --
out_cols = ["team_id", "competition_id", "season"] + list(QUALITIES.keys()) + ["family_id", "family_name"]
out_df_export = qdf_clean[out_cols].copy()
out_path = PROCESSED / "team_styles" / "team_style_clusters.parquet"
out_df_export.to_parquet(out_path, index=False)
print(f"Saved: {out_path}")
print(f"Shape: {out_df_export.shape}")

# -- Save full quality scores --
qdf.to_parquet(PROCESSED / "team_styles" / "team_qualities.parquet", index=False)
print(f"\nSaved: team_qualities.parquet ({len(qdf):,} rows)")

Saved: /Users/jorgepadilla/Documents/Documents - Jorge’s MacBook Air/thesis_data/raw_data/Teams_stats/team_style_clusters.parquet
Shape: (17178, 12)

Saved: team_qualities.parquet (17,196 rows)


## Assumptions & Design Decisions

This analysis makes several assumptions that should be validated with the supervisor. They are listed below and printed in the next cell for easy reference.

1. **Division 1 filter (CRITICAL)** — Only top-flight (first-division) leagues are included (`division == 1` from `competitions_wyscout.parquet`). This is the most important filter: it ensures we compare teams operating at a similar competitive level. Including lower divisions would distort the z-scores and family assignments.

2. **higher_is_better flags** — Each metric is assigned a direction. The table below shows which direction is "better" for each quality:

| Quality | Metric | Higher is better? |
|---------|--------|-------------------|
| Defence | defensive_intensity | Yes |
| Defence | ppda | **No** (lower PPDA = more pressing) |
| Defence | final_third_recoveries_pct | Yes |
| Defence | defensive_action_height_m | Yes |
| Def. Transition | recoveries_within_5s_pct | Yes |
| Def. Transition | time_to_defensive_action_after_loss_att_half_s | **No** (faster = better) |
| Def. Transition | time_to_defensive_action_after_loss_own_half_s | **No** |
| Att. Transition | possessions_retained_after_5s_pct | **No** (lower = more direct) |
| Att. Transition | final_third_entry_within_10s_after_recovery_own_half_pct | Yes |
| Att. Transition | first_pass_forward_after_recovery_own_half_pct | Yes |
| Att. Transition | median_time_to_first_forward_pass_own_half_s | **No** (faster = more direct) |
| Attack | long_ball_pct | Yes |
| Attack | forward_passes_from_middle_third_pct | Yes |
| Attack | buildups_from_goalkicks_pct | **No** (lower = more direct) |
| Penetration | box_entries_from_carries_pct | Yes |
| Penetration | box_entries_from_crosses_pct | **No** (higher carries, lower crosses) |
| Penetration | crosses_per_final_third_possession | **No** |
| Chance Creation | shots_per_final_third_pass | Yes |
| Chance Creation | shots_from_direct_attacks_pct | Yes |
| Chance Creation | shots_from_sustained_attacks_pct | **No** (higher direct, lower sustained) |
| Outcome | xpts | Yes |
| Outcome | points | Yes |

3. **defensive_action_height_m** — Used as the defensive line height column.
4. **Quality weights** — Per supervisor’s `teams_qualities.md` specification.
5. **Z-scores within (competition_id, season)** — Each metric is standardised relative to the team’s own league and season to enable cross-league comparison.
6. **Minimum 3 teams per group** — League-seasons with fewer than 3 teams are dropped (insufficient sample for z-scoring).
7. **k = 5 families** — Chosen for football interpretability; tested k = 3 through 8.
8. **Outcome excluded from style clustering** — The six style dimensions drive family assignment; Outcome is reported separately.

In [22]:
# -- Print assumptions checklist for easy review --
assumptions = [
    ("Division 1 filter [CRITICAL]", "Only top-flight leagues (division == 1). Most important filter."),
    ("higher_is_better flags", "See table in markdown cell above"),
    ("Defensive line height", "defensive_action_height_m column"),
    ("Quality weights", "Per supervisor's teams_qualities.md"),
    ("Z-score grouping", "Within (competition_id, season)"),
    ("Minimum group size", "3 teams per (competition, season)"),
    ("Number of families", "k = 5 (tested 3-8)"),
    ("Outcome excluded from clustering", "Only 6 style dims used for K-Means"),
]

print("=" * 70)
print("  ASSUMPTIONS CHECKLIST")
print("=" * 70)
for i, (title, detail) in enumerate(assumptions, 1):
    status = "✅"  # checkmark
    print(f"  {status}  {i}. {title}")
    print(f"       {detail}")
print("=" * 70)
print(f"\n  Total team-seasons: {len(qdf_clean):,}")
print(f"  Total leagues:      {qdf_clean.competition_id.nunique()}")
print(f"  Seasons:            {qdf_clean.season.min()}–{qdf_clean.season.max()}")
print(f"  Families:           {BEST_K}")
print(f"\n  Outputs:")
print(f"    • team_style_clusters.parquet  ({len(out_df_export):,} rows)")
print(f"    • team_qualities.parquet       ({len(qdf):,} rows)")

  ASSUMPTIONS CHECKLIST
  ✅  1. Division 1 filter [CRITICAL]
       Only top-flight leagues (division == 1). Most important filter.
  ✅  2. higher_is_better flags
       See table in markdown cell above
  ✅  3. Defensive line height
       defensive_action_height_m column
  ✅  4. Quality weights
       Per supervisor's teams_qualities.md
  ✅  5. Z-score grouping
       Within (competition_id, season)
  ✅  6. Minimum group size
       3 teams per (competition, season)
  ✅  7. Number of families
       k = 5 (tested 3-8)
  ✅  8. Outcome excluded from clustering
       Only 6 style dims used for K-Means

  Total team-seasons: 17,178
  Total leagues:      168
  Seasons:            2014–2026
  Families:           5

  Outputs:
    • team_style_clusters.parquet  (17,178 rows)
    • team_qualities.parquet       (17,196 rows)
