# Pressure Cooker — FC Barcelona Defensive Analysis

**More than a Hack 2026**

This notebook documents the data exploration behind the project: what the data looks like, what we discovered, why we pivoted from individual player analysis to team defensive analysis, and how the final pipeline works.

**Tactical question:** When and how does Barcelona’s defensive structure break down, and which vulnerability patterns recur across matches?

In [None]:
from data_loader import list_matches, load_events
from notebook_utils import (
    show_event_code_distribution,
    show_tag_overlap_examples,
    show_duration_stats,
    show_time_offset_demonstration,
)
import pandas as pd
import numpy as np

matches = list_matches()
MATCH   = 5  # Barca vs Manchester City (2-2) - change to explore other matches
events  = load_events(MATCH)

print(f"{len(matches)} matches available:")
for i, m in enumerate(matches):
    print(f"  [{i}] {m}")

## Section 1 — Smart Tagging Data

Each match folder has a `*_pattern.xml` file from Metrica’s Smart Tagging tool. It’s a coded event timeline with start/end timestamps and tactical labels. This is the primary signal for the risk engine.

### Data Schema

**Events DataFrame** (`*_pattern.xml` parsed via kloppy sportscode)

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| `code_id` | int | Unique ID per event instance | 42 |
| `period_id` | int | Match period (1 = 1st half, 2 = 2nd half) | 1 |
| `timestamp` | Timedelta | Event start time (relative to video start) | 0 days 00:23:14 |
| `end_timestamp` | Timedelta | Event end time | 0 days 00:25:01 |
| `code` | str | Tactical event label | BALL IN THE BOX |
| `Team` | str | Team performing the action (or 'N/A') | FC Barcelona |
| `Half` | str | Which half | 1st Half |
| `Type` | str | Sub-type label (varies by code, often NaN) | Open Play |
| `Side` | str | Side of pitch | Left |
| `Direction of ball entry` | str | Direction ball entered the zone | Through the middle |
| `Max Players in the box` | str | Player count in box during event | 3 |
| `match_name` | str | Folder name of the match | Barca - Manchester City (2-2...) |

### Multiple Events Occur in Parallel

At any given second, several tags may be active simultaneously, each capturing a different layer of the same passage of play. For example, a PROGRESSION event may overlap with a BALL IN FINAL THIRD and an ATTACKING TRANSITION.

## Section 2 — Player Tracking Data

Each match folder also has FIFA tracking files (`*_FifaData.xml`, `*_FifaDataRawData.txt`) providing per-frame player positions at 25 fps.

### Data Schema

**Tracking DataFrame** (`*_FifaDataRawData.txt` via kloppy metrica EPTS)

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| `period_id` | int | Match period | 1 |
| `timestamp` | Timedelta | Frame timestamp at 25 fps | 0 days 00:01:23.040000 |
| `frame_id` | int | Sequential frame index | 1980 |
| `ball_x` | float | Ball x-coordinate (metres) | 34.21 |
| `ball_y` | float | Ball y-coordinate (metres) | 21.04 |
| `ball_z` | float | Ball height | 0.12 |
| `ball_speed` | float | Ball speed | 8.4 |
| `<player_id>_x` | float | Player x-coordinate (NaN if off-camera) | NaN |
| `<player_id>_y` | float | Player y-coordinate | NaN |
| `<player_id>_d` | float | Distance covered since last frame | NaN |
| `<player_id>_s` | float | Player speed | NaN |

### Tracking Data is Mostly NaN

Broadcast cameras follow the ball, so players frequently go off-screen. When a player is off-camera their position is NaN. For most outfield players the majority of the match has no position data.

## Section 3 — Data Exploration

The cells below explore the event data structure for a single match.

### 3.1 — Event Code Distribution

In [None]:
show_event_code_distribution(events)

### 3.2 — Events in Parallel

All events active at minute 10, showing how tags overlap at the same moment.

In [None]:
show_tag_overlap_examples(events, minute=10)

### 3.3 — Event Duration Statistics

In [None]:
show_duration_stats(events)

### 3.4 — Time Offset Demonstration

Raw video timestamps alongside corrected in-game timestamps, showing the offset between video time and match clock.

In [None]:
show_time_offset_demonstration(events)

## Section 4 — Post Data Exploration Insights

**What we found:**

- Events are team-level, not player-level. Smart tags record what a team is doing, not individual players.
- Events heavily overlap in time. At any given second, 3–6 codes may be active simultaneously.
- Time offsets must be manually calibrated per match video. There is no automatic reference point in the data.

**Data Limitations:**

- Individual player analysis is not feasible. Tracking identity fragmentation (40–60 extra IDs per match) makes it intractable without major manual effort.
- Tracking data does not map numeric IDs to known player names or jersey numbers.

## Section 5 — Time Offset Issue

### Problem

Smart Tagging and tracking timestamps are relative to video start, not the match clock. The in-game timer might not begin until 15:32 in the video, creating a 15:32 offset in all data.

Three offsets need to be removed to align data with match time:
- Pre-match broadcast segment
- Halftime interval
- Post-match segment

### Solution

Manually record the game start time in the video, halftime boundaries, and game end time. Use these to offset all event and tracking timestamps. Our dashboard stores per-match video calibration offsets so danger moments seek to the correct video frame.

## Section 6 — Player Tracking Issue

**Camera tracking limitation:** Players come in and out of frame throughout the match. The tracking software can’t maintain consistent position data when players go off-screen.

**Identity fragmentation:** Metrica prioritised consistency over identity continuity. When a player re-enters frame, they may get a new tracking ID. This inflates player ID counts by 40–60 per match.

**No player identity mapping:** Tracking data uses numeric IDs with no link to player names or jersey numbers.

**Partial recovery:** We built a custom batch parser (`tracking_batch_parser.py`) that extracts usable ball positions and partial player positions from the raw ATD feed across all 11 matches. The parser produces `player_positions.csv`, `ball_positions.csv`, and `team_map.json` per match. This recovered data is used for spatial features (team shape, ball proximity, overload detection) in danger moment explanations, but it is not reliable enough for full defensive shape or compactness analysis.

## Section 7 — Project Pivot

### Original Plan

The original concept was individual player performance analysis — scoring each player’s contribution across matches.

### Why We Pivoted

Tracking identity fragmentation and sparse position data made player-level analysis impractical. Linking numeric IDs to real players would require extensive manual work with no guarantee of accuracy.

### New Direction: Team-Wide Defensive Analysis

Instead of asking "how did player X perform?", we pivoted to: where did Barcelona's defensive shape break down, and why?

This question is fully answerable from the smart tagging data, which is clean and complete. The pipeline runs a continuous defensive risk score over match time, detects fault-line moments (risk peaks linked to goals conceded), and uses an LLM to generate coach-friendly tactical explanations, all surfaced through an interactive dashboard with video seek.

## Section 8 — Pipeline Walkthrough

The following cells run the core analysis pipeline on one match to show how it works.

### 8.1 — Risk Scoring

The risk engine converts Smart Tagging events into a continuous 0–100 risk score. Opponent attacking events (e.g. BALL IN THE BOX: +1.55, DEFENSIVE TRANSITION: +1.35) increase risk. Barcelona possession events (e.g. POSSESSION: −0.35) decrease it. The signal is built on a 0.25-second grid, smoothed with a 3-second moving average, and min-max scaled to 0–100.

In [None]:
from risk_engine import compute_risk_for_match
import matplotlib.pyplot as plt

risk = compute_risk_for_match(MATCH)
print(f"Risk array: {len(risk['time_s'])} samples ({risk['time_s'][-1]:.0f} seconds)")
print(f"Score range: {risk['score'].min():.1f} to {risk['score'].max():.1f}")

fig, ax = plt.subplots(figsize=(14, 3))
ax.fill_between(risk['time_s'] / 60, risk['score'], alpha=0.4, color='#A50044')
ax.plot(risk['time_s'] / 60, risk['score'], linewidth=0.5, color='#A50044')
ax.set_xlabel('Match time (minutes)')
ax.set_ylabel('Risk score (0\u2013100)')
ax.set_title(f'Defensive Risk Timeline \u2014 {events["match_name"].iloc[0]}')
ax.set_xlim(0, risk['time_s'].max() / 60)
ax.set_ylim(0, 105)
plt.tight_layout()
plt.show()

### 8.2 — Danger Detection

Danger moments are continuous segments where risk > 45. Segments shorter than 5 seconds are discarded. Segments separated by < 12 seconds are merged. Each gets a severity label: High (≥80), Moderate (≥50), Low (≥25), Very Low (<25).

In [None]:
from danger_detector import detect_dangers

dangers = detect_dangers(risk)
print(f"Detected {len(dangers)} danger moments:\n")

severity_counts = {}
for d in dangers:
    sev = d.get('severity', 'unknown')
    severity_counts[sev] = severity_counts.get(sev, 0) + 1

for sev, count in sorted(severity_counts.items()):
    print(f"  {sev}: {count}")

print(f"\nTop 5 by peak score:")
top5 = sorted(dangers, key=lambda d: d['peak_score'], reverse=True)[:5]
for d in top5:
    start_min = d['start_s'] / 60
    end_min = d['end_s'] / 60
    print(f"  {start_min:.1f}' \u2013 {end_min:.1f}' | peak: {d['peak_score']:.1f} | {d.get('severity', 'N/A')}")

### 8.3 — All Matches Summary

Running the pipeline across all 11 matches.

In [None]:
total_dangers = 0

print(f"{'Match':<55} {'Dangers':>8} {'High':>6} {'Mod':>6}")
print("-" * 80)

for i in range(len(matches)):
    r = compute_risk_for_match(i)
    d = detect_dangers(r)
    
    n_high = sum(1 for x in d if x.get('severity') == 'high')
    n_mod  = sum(1 for x in d if x.get('severity') == 'moderate')
    total_dangers += len(d)
    
    name = matches[i][:52]
    print(f"  {name:<53} {len(d):>8} {n_high:>6} {n_mod:>6}")

print("-" * 80)
print(f"  Total danger moments: {total_dangers}")

### 8.4 — Pattern Analysis

Comparing danger moment signatures (the sorted set of active event codes at peak time) across matches to find recurring vulnerability patterns.

In [None]:
from pattern_analyzer import analyze_patterns

# Collect dangers from all matches
all_dangers = []
for i in range(len(matches)):
    r = compute_risk_for_match(i)
    d = detect_dangers(r)
    for moment in d:
        moment['match_index'] = i
        moment['match_name'] = matches[i]
    all_dangers.extend(d)

patterns = analyze_patterns(all_dangers)
print(f"Found {len(patterns)} recurring patterns across {len(matches)} matches:\n")

for p in patterns:
    print(f"  Pattern: {' \u2192 '.join(p['codes'])}")
    print(f"    Confidence: {p['confidence']:.3f} | Lift: {p['lift']:.2f}x | Occurrences: {p['occurrences']}")
    print(f"    Matches: {', '.join(p.get('match_names', []))}")
    print()

## Section 9 — Suggestions for Metrica Nexus

**Smart Tagging timeline offset tool:** Let users input the 4 calibration timestamps (game start, halftime start/end, game end) so Smart Tagging data aligns with the match clock automatically.

**Player ID merging:** Let users merge fragmented tracking IDs and connect them to known player names or external databases.

**Player recognition confidence scores:** Include confidence values when the system correlates different tracking IDs, so users know how reliable the ID assignment is.