# FC Barcelona Defensive Analysis - Project Overview

Data exploration and project context for the FCB Defensive Fault Lines project, covering what the data looks like, what we discovered, and why we pivoted from individual player analysis to team defensive analysis.

In [None]:
from data_loader import list_matches, load_events
from notebook_utils import (
    show_event_code_distribution,
    show_tag_overlap_examples,
    show_duration_stats,
    show_time_offset_demonstration,
)
import pandas as pd

matches   = list_matches()
MATCH     = 5  # Barca vs Manchester City - change to explore other matches
events_df = load_events(MATCH)

## Section 1 - Smart Tagging Data

Each match folder contains a `*_pattern.xml` file produced by Metrica's SportsCoding tool, a coded event timeline with start/end timestamps and tactical labels. This is the primary signal for the risk engine.

### Data Schema

**Events DataFrame** (`*_pattern.xml` via kloppy sportscode)

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| `code_id` | int | Unique ID for each event instance | 42 |
| `period_id` | int | Match period (1 = 1st half, 2 = 2nd half) | 1 |
| `timestamp` | Timedelta | Event start time (relative to video start) | 0 days 00:23:14 |
| `end_timestamp` | Timedelta | Event end time | 0 days 00:25:01 |
| `code` | str | Tactical event label | BALL IN THE BOX |
| `Team` | str | Team performing the action (or 'N/A') | FC Barcelona |
| `Half` | str | Which half the event belongs to | 1st Half |
| `Type` | str | Sub-type label (varies by code, often NaN) | Open Play |
| `Side` | str | Side of pitch | Left |
| `Direction of ball entry` | str | Direction ball entered the zone | Through the middle |
| `Max Players in the box` | str | Player count in box during event | 3 |
| `match_name` | str | Folder name of the match | Barca - Manchester City (2-2_...) |

### Multiple Events Occur in Parallel

Multiple event codes are active simultaneously at any given moment. At any given second, several tags may be active in parallel, each capturing a different layer of the same passage of play. For example, a PROGRESSION event may overlap with a BALL IN FINAL THIRD and an ATTACKING TRANSITION, all occurring at the same time.

## Section 2 - Player Tracking Data

Each match folder also contains FIFA tracking files (`*_FifaData.xml`, `*_FifaDataRawData.txt`) providing per-frame player positions at 25 fps.

### Data Schema

**Tracking DataFrame** (`*_FifaDataRawData.txt` via kloppy metrica EPTS)

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| `period_id` | int | Match period | 1 |
| `timestamp` | Timedelta | Frame timestamp at 25 fps | 0 days 00:01:23.040000 |
| `frame_id` | int | Sequential frame index | 1980 |
| `ball_x` | float | Ball x-coordinate (metres) | 34.21 |
| `ball_y` | float | Ball y-coordinate (metres) | 21.04 |
| `ball_z` | float | Ball height | 0.12 |
| `ball_speed` | float | Ball speed | 8.4 |
| `<player_id>_x` | float | Player x-coordinate (NaN if off-camera) | NaN |
| `<player_id>_y` | float | Player y-coordinate | NaN |
| `<player_id>_d` | float | Distance covered since last frame | NaN |
| `<player_id>_s` | float | Player speed | NaN |

### Tracking Data is Mostly NaN

Because broadcast cameras follow the ball, players frequently go off-screen. When a player is off-camera, their position data is recorded as NaN. This means that for most players, the majority of their tracking data across a match is missing.

## Section 3 - Data Exploration

The following cells contain supporting data exploration for the sections below.

### 3.1 - Event Code Distribution

In [None]:
show_event_code_distribution(events_df)

### 3.2 - Events in Parallel

The table below shows all events active at minute 10, illustrating how multiple tags overlap at the same moment.

In [None]:
show_tag_overlap_examples(events_df, minute=10)

### 3.3 - Event Duration Statistics

In [None]:
show_duration_stats(events_df)

### 3.4 - Time Offset Demonstration

The table below shows raw video timestamps alongside corrected in-game timestamps, illustrating the offset described in Section 5.

In [None]:
show_time_offset_demonstration(events_df)

## Section 4 - Post Data Exploration Insights

**Key Findings**

- **Team-level data:** Events are team-level, not player-level. Smart tags record what a team is doing, not individual players.
- **Parallel events:** Events heavily overlap in time. At any given second, 3-6 codes may be active simultaneously.
- **Manual time calibration required:** Time offsets must be manually calibrated per match video. There is no automatic reference point in the data.

**Data Limitations**

- **Player analysis not feasible:** Individual player analysis is not feasible. Tracking identity fragmentation makes it intractable without significant manual work.
- **No player identity mapping:** Metrica provides numeric IDs but does not link them to known player names or jersey numbers.

## Section 5 - Time Offset Issue

### Issue

**Video time vs in-game time:** The timestamps in the Player Tracking Data and Smart Tagging Data match up with the video time, but do not match up with the in-game timer. This is because the in-game timer does not match up with the video time. This causes an offset with when tags occur in the XML files and when they occur in the in-game timer, making the tag times not line up with the in-game times.

**Example:** The in-game timer will start at 15:32 in the video, causing an initial 15:32 offset in time within the data.

**Three offsets to remove:** To match up the data to in-game time, there are 3 offsets in the video time that need to be removed:
- Before the start of the game
- Half time interval
- After the game ends

### Solution

**Manual calibration:** Manually record the start time of the game in the video, the start of the halftime interval, the end of the halftime interval, and the end time of the game. Use those times to offset all events in the player tracking and smart tagging data if you want it to match up to the in-game timer rather than the video.

## Section 6 - Player Tracking Issue

**Camera tracking limitation:** The match videos have players coming in and out of frame. Not all 22 players are being shown at every moment so the player tracking computer vision software struggles to keep track of players.

**Identity fragmentation:** Metrica Nexus made the decision to prioritize consistency over identity continuity. This leads to multiple identities being created per player. Player id counts in games could increase by 40 to 60 additional player ids.

**No player identity mapping:** Player tracking data does not specify which player each id correlates to. This must be manually inferred.

## Section 7 - Project Pivot

### Original Plan

The original concept for this project was to do an individual player analysis where our application would give insights on each player's individual performance.

### Why We Pivoted

Due to the issues with the multiple player_ids and the difficulty of relating player_ids to players, we had to scrap the player-based analysis and pivot towards a team-based analysis and game analysis.

### New Direction: Team-Wide Defensive Analysis

Instead of asking "how did player X perform?", we pivoted to: where did Barcelona's defensive shape break down, and why?

This question is fully answerable from the smart tagging data, which is clean and complete. The pipeline runs a continuous defensive risk score over match time, detects fault-line moments (risk peaks linked to goals conceded), and uses an LLM to generate coach-friendly tactical explanations, all surfaced through an interactive dashboard with video seek.

## Section 8 - Metrica Nexus Improvements

**Smart Tagging Timeline Offset:** Metrica Nexus could give users the option to offset the Smart Tagging Timeline so it matches with the game time. Users would be asked to input the 4 offset times for a match video. This would allow for smart tags to line up with the in-game timer, making analysis more intuitive and easy to understand.

**Player ID Merging:** Metrica Nexus could give users the option to merge player_ids in the player tracking data and connect the player_ids to existing player names or player stats in a database.

**Player Recognition Confidence:** Metrica Nexus could also improve their player recognition software to include confidence values for correlating different player_ids to each other.