# Data Exploration - FCB Player Peak

This notebook explores the data for **one match** to understand what we are working with.

**Workflow:** We develop everything on a single match, then automate across all 11.

### What we have per match:
- **Events** (`*_pattern.xml`) - tactical phase annotations (BUILD UP, PROGRESSION, GOALS, etc.) with start/end times. Team-level, not player-level.
- **Tracking metadata** (`*_FifaData.xml`) - teams, player track IDs, pitch dimensions, frame rate
- **Tracking positions** (`*_FifaDataRawData.txt`) - per-frame x/y/speed, but mostly NaN for non-GK players

In [1]:
from data_loader import list_matches, load_events, load_tracking_dataset
import pandas as pd

matches = list_matches()
print(f"{len(matches)} matches available:\n")
for i, m in enumerate(matches):
    print(f"  [{i}] {m}")

11 matches available:

  [0] AC Milan - Barça (0-1) Partit Amistós Gira 2023-2024
  [1] Arsenal - Barça (5-3) Partit Amistós Gira 2023-2024
  [2] Barça - AC Milan (2-2_ 3-4) Partit Amistós Gira 2024-2025
  [3] Barça - AS Mònaco (0-3) Trofeu Joan Gamper 2024-2025
  [4] Barça - Como 1907 (5-0) Trofeu Joan Gamper 2025-2026
  [5] Barça - Manchester City (2-2_ 4-1) Partit Amistós Gira 2024-2025
  [6] Barça - Reial Madrid (3-0) Partit Amistós Gira 2023-2024
  [7] Daegu FC - Barça (0-5) Partit Amistós Gira Pretemporada 2025-2026
  [8] FC Seül - Barça (3-7) Partit Amistós Gira Pretemporada 2025-2026
  [9] Reial Madrid - Barça (1-2) Partit Amistós Gira 2024-2025
  [10] Vissel Kobe - Barça (1-3) Partit Amistós Gira Pretemporada 2025-2026


## 1. Pick a match

Change `MATCH` to explore a different game.

In [2]:
MATCH = 5  # Barca - Manchester City (2-2, 4-1) - lots of action

events = load_events(MATCH)
ds = load_tracking_dataset(MATCH, limit=10)  # just for metadata, not full tracking

print(f"Match: {events['match_name'].iloc[0]}")
print(f"Events: {len(events)} rows")
print(f"Pitch: {ds.metadata.pitch_dimensions.pitch_length}x{ds.metadata.pitch_dimensions.pitch_width}m")
print(f"Frame rate: {ds.metadata.frame_rate} fps")
print()
for team in ds.metadata.teams:
    print(f"  {team.name}: {len(team.players)} player tracks")

Match: Barça - Manchester City (2-2_ 4-1) Partit Amistós Gira 2024-2025
Events: 647 rows
Pitch: 105x68m
Frame rate: 25 fps

  FC Barcelona: 33 player tracks
  Manchester City: 32 player tracks


## 2. Events DataFrame - what does one row look like?

In [3]:
print(f"Columns: {events.columns.tolist()}\n")
events.head(15)

Columns: ['code_id', 'period_id', 'timestamp', 'end_timestamp', 'code', 'Team', 'Half', 'Max Players in the box', 'Side', 'Direction of ball entry', 'Type', 'match_name']



Unnamed: 0,code_id,period_id,timestamp,end_timestamp,code,Team,Half,Max Players in the box,Side,Direction of ball entry,Type,match_name
0,4,1,0 days 00:00:14.800000,0 days 00:00:49.800000,PROGRESSION,FC Barcelona,1st Half,,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
1,5,1,0 days 00:00:14.800000,0 days 00:00:49.800000,DEFENDING IN MIDDLE THIRD,Manchester City,1st Half,,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
2,6,1,0 days 00:01:00.200000,0 days 00:01:35.200000,BUILD UP,Manchester City,1st Half,,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
3,7,1,0 days 00:01:00.200000,0 days 00:01:35.200000,DEFENDING IN ATTACKING THIRD,FC Barcelona,1st Half,,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
4,8,1,0 days 00:01:05,0 days 00:02:12.120000,BALL IN FINAL THIRD,FC Barcelona,1st Half,4.0,Central,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
5,9,1,0 days 00:01:05,0 days 00:01:28.960000,BALL IN THE BOX,FC Barcelona,1st Half,1.0,,Vertical,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
6,10,1,0 days 00:01:06.080000,0 days 00:01:28.080000,PLAYERS IN THE BOX,FC Barcelona,1st Half,1.0,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
7,11,1,0 days 00:01:28.180000,0 days 00:01:51.240000,PLAYERS IN THE BOX,FC Barcelona,1st Half,1.0,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
8,12,1,0 days 00:01:31.400000,0 days 00:01:50.840000,BALL IN THE BOX,FC Barcelona,1st Half,1.0,,Vertical,,Barça - Manchester City (2-2_ 4-1) Partit Amis...
9,13,1,0 days 00:01:35.300000,0 days 00:02:06.880000,BUILD UP,Manchester City,1st Half,,,,,Barça - Manchester City (2-2_ 4-1) Partit Amis...


## 3. Event codes - what types of events are tagged?

These are the tactical phases/actions that Metrica coded for this match.

In [4]:
print("Event codes and counts:\n")
print(events['code'].value_counts().to_string())
print(f"\nTotal unique codes: {events['code'].nunique()}")

Event codes and counts:

code
BALL IN FINAL THIRD             121
PLAYERS IN THE BOX              112
BALL IN THE BOX                 107
PROGRESSION                      49
DEFENDING IN MIDDLE THIRD        49
SET PIECES                       49
ATTACKING TRANSITION             36
DEFENSIVE TRANSITION             36
CREATING CHANCES                 22
DEFENDING IN DEFENSIVE THIRD     22
BUILD UP                         15
DEFENDING IN ATTACKING THIRD     15
LONG BALL                        10
GOALS                             4

Total unique codes: 14


In [5]:
# Events split by team
print("Events per team:\n")
for team in events['Team'].dropna().unique():
    team_events = events[events['Team'] == team]
    print(f"--- {team} ({len(team_events)} events) ---")
    print(team_events['code'].value_counts().to_string())
    print()

Events per team:

--- FC Barcelona (303 events) ---
code
BALL IN FINAL THIRD             61
BALL IN THE BOX                 56
PLAYERS IN THE BOX              52
DEFENDING IN MIDDLE THIRD       30
PROGRESSION                     19
ATTACKING TRANSITION            19
DEFENSIVE TRANSITION            17
DEFENDING IN DEFENSIVE THIRD    13
DEFENDING IN ATTACKING THIRD     8
BUILD UP                         7
LONG BALL                        7
SET PIECES                       7
CREATING CHANCES                 5
GOALS                            2

--- Manchester City (303 events) ---
code
PLAYERS IN THE BOX              60
BALL IN FINAL THIRD             60
BALL IN THE BOX                 51
PROGRESSION                     30
DEFENDING IN MIDDLE THIRD       19
DEFENSIVE TRANSITION            19
ATTACKING TRANSITION            17
CREATING CHANCES                13
SET PIECES                       9
BUILD UP                         8
DEFENDING IN ATTACKING THIRD     7
DEFENDING IN DEFENSIVE TH

## 4. Time structure - when do events happen?

Each event has a `timestamp` (start) and `end_timestamp` (end). Let's see how they distribute across the match.

In [6]:
# Convert timestamps to minutes for readability
events['start_min'] = events['timestamp'].dt.total_seconds() / 60
events['end_min'] = events['end_timestamp'].dt.total_seconds() / 60
events['duration_sec'] = (events['end_timestamp'] - events['timestamp']).dt.total_seconds()

print(f"Match spans: {events['start_min'].min():.1f} min to {events['end_min'].max():.1f} min")
print(f"Event durations: min={events['duration_sec'].min():.1f}s, median={events['duration_sec'].median():.1f}s, max={events['duration_sec'].max():.1f}s")
print(f"\nHalves: {events['Half'].value_counts().to_dict()}")
print(f"\nDuration stats per code:")
events.groupby('code')['duration_sec'].describe()[['count','mean','min','max']].round(1)

Match spans: 0.2 min to 98.8 min
Event durations: min=0.5s, median=31.6s, max=150.0s

Halves: {'1st Half': 345, '2nd Half': 302}

Duration stats per code:


Unnamed: 0_level_0,count,mean,min,max
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ATTACKING TRANSITION,36.0,34.9,31.5,35.0
BALL IN FINAL THIRD,121.0,27.0,6.7,67.1
BALL IN THE BOX,107.0,17.3,5.4,33.2
BUILD UP,15.0,34.8,31.6,35.0
CREATING CHANCES,22.0,29.6,24.0,30.0
DEFENDING IN ATTACKING THIRD,15.0,34.8,31.6,35.0
DEFENDING IN DEFENSIVE THIRD,22.0,29.6,24.0,30.0
DEFENDING IN MIDDLE THIRD,49.0,34.9,30.1,35.0
DEFENSIVE TRANSITION,36.0,34.9,31.5,35.0
GOALS,4.0,150.0,150.0,150.0


## 5. What overlaps? Events happen in parallel

Multiple events can be active at the same time (e.g., BUILD UP + BALL IN FINAL THIRD + PLAYERS IN THE BOX). Let's look at a time slice.

In [7]:
# Show all events active during a specific minute
TARGET_MIN = 10  # change this to explore different moments

active = events[(events['start_min'] <= TARGET_MIN) & (events['end_min'] >= TARGET_MIN)]
print(f"Events active at minute {TARGET_MIN} ({len(active)} events):\n")
active[['code', 'Team', 'start_min', 'end_min', 'duration_sec', 'Type', 'Side']].sort_values('start_min')

Events active at minute 10 (5 events):



Unnamed: 0,code,Team,start_min,end_min,duration_sec,Type,Side
75,PROGRESSION,Manchester City,9.591667,10.170667,34.74,,
76,DEFENDING IN MIDDLE THIRD,FC Barcelona,9.591667,10.170667,34.74,,
77,PLAYERS IN THE BOX,Manchester City,9.822,10.26,26.28,,
78,ATTACKING TRANSITION,FC Barcelona,9.914667,10.498,35.0,,
79,DEFENSIVE TRANSITION,Manchester City,9.914667,10.498,35.0,,


## 6. Five-minute windows - what does the match look like in chunks?

This is the granularity we will use for the dashboard (users pick a 5-min window, LLM explains it).

In [8]:
import numpy as np

# Bin events into 5-minute windows based on their start time
match_end = events['end_min'].max()
bins = np.arange(0, match_end + 5, 5)
events['window'] = pd.cut(events['start_min'], bins=bins, right=False)

# Count events per window, split by team
window_counts = events.groupby(['window', 'Team'])['code'].count().unstack(fill_value=0)
print("Events per 5-minute window by team:\n")
print(window_counts.to_string())

Events per 5-minute window by team:

Team           FC Barcelona  Manchester City  N/A
window                                           
[0.0, 5.0)               16               12    2
[5.0, 10.0)              16               33    1
[10.0, 15.0)             24               18    1
[15.0, 20.0)             18               19    2
[20.0, 25.0)             21                8    0
[25.0, 30.0)             10               17    1
[30.0, 35.0)             10               17    1
[35.0, 40.0)             13               33    0
[40.0, 45.0)             12               20    3
[45.0, 50.0)             17                9    3
[50.0, 55.0)             10               19    2
[55.0, 60.0)              2               16    2
[60.0, 65.0)             28               16    1
[65.0, 70.0)             19                5    2
[70.0, 75.0)             15               13    2
[75.0, 80.0)              6                5    1
[80.0, 85.0)             13               15    4
[85.0, 90.0) 

In [9]:
# Breakdown: what codes happen in each window for Barca?
barca_events = events[events['Team'] == 'FC Barcelona']

print("FC Barcelona event codes per 5-min window:\n")
barca_breakdown = barca_events.groupby(['window', 'code']).size().unstack(fill_value=0)
print(barca_breakdown.to_string())

FC Barcelona event codes per 5-min window:

code           ATTACKING TRANSITION  BALL IN FINAL THIRD  BALL IN THE BOX  BUILD UP  CREATING CHANCES  DEFENDING IN ATTACKING THIRD  DEFENDING IN DEFENSIVE THIRD  DEFENDING IN MIDDLE THIRD  DEFENSIVE TRANSITION  GOALS  LONG BALL  PLAYERS IN THE BOX  PROGRESSION  SET PIECES
window                                                                                                                                                                                                                                                                           
[0.0, 5.0)                        1                    2                4         0                 0                             2                             0                          3                     0      0          0                   3            1           0
[5.0, 10.0)                       4                    0                0         2                 0                             0   

## 7. Label columns - what extra info do events carry?

Beyond `code` and `Team`, events have labels like `Type`, `Side`, `Direction of ball entry`, `Max Players in the box`.

In [10]:
label_cols = ['Type', 'Side', 'Direction of ball entry', 'Max Players in the box', 'Half']

for col in label_cols:
    if col in events.columns:
        print(f'--- {col} ---')
        print(events[col].value_counts(dropna=False).to_string())
        print()

--- Type ---
Type
NaN                      598
Throw-in                  31
Goal Kick                  7
Corner Kick                7
Free Kick                  2
Kick Off (Start)           1
Kick Off (After Goal)      1

--- Side ---
Side
NaN        526
Central     91
Right       16
Left        14

--- Direction of ball entry ---
Direction of ball entry
NaN           540
Vertical       42
Horizontal     38
Diagonal       27

--- Max Players in the box ---
Max Players in the box
None    307
1        69
2        68
3        59
4        57
5        53
6        20
7+       14

--- Half ---
Half
1st Half    345
2nd Half    302



In [11]:
# Which codes use which labels?
print('Labels available per event code:\n')
for code in sorted(events['code'].unique()):
    subset = events[events['code'] == code]
    non_null_labels = [col for col in label_cols if col in subset.columns and subset[col].notna().any()]
    print(f'  {code}: {non_null_labels if non_null_labels else "(no labels)"}')

Labels available per event code:

  ATTACKING TRANSITION: ['Half']
  BALL IN FINAL THIRD: ['Side', 'Max Players in the box', 'Half']
  BALL IN THE BOX: ['Direction of ball entry', 'Max Players in the box', 'Half']
  BUILD UP: ['Half']
  CREATING CHANCES: ['Half']
  DEFENDING IN ATTACKING THIRD: ['Half']
  DEFENDING IN DEFENSIVE THIRD: ['Half']
  DEFENDING IN MIDDLE THIRD: ['Half']
  DEFENSIVE TRANSITION: ['Half']
  GOALS: ['Half']
  LONG BALL: ['Half']
  PLAYERS IN THE BOX: ['Max Players in the box', 'Half']
  PROGRESSION: ['Half']
  SET PIECES: ['Type', 'Half']


## 8. Tracking data - what is actually usable?

Quick check on how much position data exists per player track.

In [12]:
from data_loader import load_tracking

tracking = load_tracking(MATCH, limit=1000)
print(f'Tracking shape (1000 frames): {tracking.shape}')

# Check which players have actual position data
player_x_cols = [c for c in tracking.columns if c.endswith('_x') and c != 'ball_x']
non_null = tracking[player_x_cols].notna().sum().sort_values(ascending=False)

print(f'\nPlayers with data (out of {len(player_x_cols)} tracks):')
has_data = non_null[non_null > 0]
no_data = non_null[non_null == 0]
print(f'  With position data: {len(has_data)}')
print(f'  All NaN (no data): {len(no_data)}')

if len(has_data) > 0:
    print(f'\nPlayers with data:')
    for col, count in has_data.items():
        pid = col.replace('_x', '')
        print(f'  Player {pid}: {count}/{len(tracking)} frames ({100*count/len(tracking):.0f}%)')

Tracking shape (1000 frames): (1000, 270)

Players with data (out of 65 tracks):
  With position data: 25
  All NaN (no data): 40

Players with data:
  Player 33: 1000/1000 frames (100%)
  Player 65: 1000/1000 frames (100%)
  Player 34: 210/1000 frames (21%)
  Player 2: 163/1000 frames (16%)
  Player 4: 141/1000 frames (14%)
  Player 1: 137/1000 frames (14%)
  Player 3: 137/1000 frames (14%)
  Player 5: 128/1000 frames (13%)
  Player 6: 127/1000 frames (13%)
  Player 7: 127/1000 frames (13%)
  Player 35: 124/1000 frames (12%)
  Player 44: 114/1000 frames (11%)
  Player 36: 96/1000 frames (10%)
  Player 9: 92/1000 frames (9%)
  Player 11: 73/1000 frames (7%)
  Player 8: 72/1000 frames (7%)
  Player 12: 69/1000 frames (7%)
  Player 10: 62/1000 frames (6%)
  Player 13: 57/1000 frames (6%)
  Player 14: 53/1000 frames (5%)
  Player 15: 48/1000 frames (5%)
  Player 42: 17/1000 frames (2%)
  Player 38: 15/1000 frames (2%)
  Player 45: 14/1000 frames (1%)
  Player 41: 7/1000 frames (1%)


## Summary

**What we learned:**
- Events are **team-level** tactical phase annotations (not player-level)
- Each event has a code (BUILD UP, PROGRESSION, etc.), team, start/end time, and optional labels (Side, Type, etc.)
- Events overlap - multiple phases can be active simultaneously
- Tracking data is **mostly empty** for non-goalkeeper players
- The events dataset is our primary signal for building peak windows

**Key question for the team:** Since events are team-level, how do we define a *player's* peak window? Options:
1. Use tracking to link player positions to team events (limited by sparse tracking)
2. Focus on team-level windows and let users pick a player via the Nexus video timestamps
3. Combine both - team windows from events + whatever player tracking is available