# Data Exploration - FCB Player Peak

This notebook explores the data for **one match** to understand what we are working with.

**Workflow:** We develop everything on a single match, then automate across all 11.

### What we have per match:
- **Events** (`*_pattern.xml`) - tactical phase annotations (BUILD UP, PROGRESSION, GOALS, etc.) with start/end times. Team-level, not player-level.
- **Tracking metadata** (`*_FifaData.xml`) - teams, player track IDs, pitch dimensions, frame rate
- **Tracking positions** (`*_FifaDataRawData.txt`) - per-frame x/y/speed, but mostly NaN for non-GK players

In [None]:
from data_loader import list_matches, load_events, load_tracking_dataset
import pandas as pd

matches = list_matches()
print(f"{len(matches)} matches available:\n")
for i, m in enumerate(matches):
    print(f"  [{i}] {m}")

## 1. Pick a match

Change `MATCH` to explore a different game.

In [None]:
MATCH = 5  # Barca - Manchester City (2-2, 4-1) - lots of action

events = load_events(MATCH)
ds = load_tracking_dataset(MATCH, limit=10)  # just for metadata, not full tracking

print(f"Match: {events['match_name'].iloc[0]}")
print(f"Events: {len(events)} rows")
print(f"Pitch: {ds.metadata.pitch_dimensions.pitch_length}x{ds.metadata.pitch_dimensions.pitch_width}m")
print(f"Frame rate: {ds.metadata.frame_rate} fps")
print()
for team in ds.metadata.teams:
    print(f"  {team.name}: {len(team.players)} player tracks")

## 2. Events DataFrame - what does one row look like?

In [None]:
print(f"Columns: {events.columns.tolist()}\n")
events.head(15)

## 3. Event codes - what types of events are tagged?

These are the tactical phases/actions that Metrica coded for this match.

In [None]:
print("Event codes and counts:\n")
print(events['code'].value_counts().to_string())
print(f"\nTotal unique codes: {events['code'].nunique()}")

In [None]:
# Events split by team
print("Events per team:\n")
for team in events['Team'].dropna().unique():
    team_events = events[events['Team'] == team]
    print(f"--- {team} ({len(team_events)} events) ---")
    print(team_events['code'].value_counts().to_string())
    print()

## 4. Time structure - when do events happen?

Each event has a `timestamp` (start) and `end_timestamp` (end). Let's see how they distribute across the match.

In [None]:
# Convert timestamps to minutes for readability
events['start_min'] = events['timestamp'].dt.total_seconds() / 60
events['end_min'] = events['end_timestamp'].dt.total_seconds() / 60
events['duration_sec'] = (events['end_timestamp'] - events['timestamp']).dt.total_seconds()

print(f"Match spans: {events['start_min'].min():.1f} min to {events['end_min'].max():.1f} min")
print(f"Event durations: min={events['duration_sec'].min():.1f}s, median={events['duration_sec'].median():.1f}s, max={events['duration_sec'].max():.1f}s")
print(f"\nHalves: {events['Half'].value_counts().to_dict()}")
print(f"\nDuration stats per code:")
events.groupby('code')['duration_sec'].describe()[['count','mean','min','max']].round(1)

## 5. What overlaps? Events happen in parallel

Multiple events can be active at the same time (e.g., BUILD UP + BALL IN FINAL THIRD + PLAYERS IN THE BOX). Let's look at a time slice.

In [None]:
# Show all events active during a specific minute
TARGET_MIN = 10  # change this to explore different moments

active = events[(events['start_min'] <= TARGET_MIN) & (events['end_min'] >= TARGET_MIN)]
print(f"Events active at minute {TARGET_MIN} ({len(active)} events):\n")
active[['code', 'Team', 'start_min', 'end_min', 'duration_sec', 'Type', 'Side']].sort_values('start_min')

## 6. Five-minute windows - what does the match look like in chunks?

This is the granularity we will use for the dashboard (users pick a 5-min window, LLM explains it).

In [None]:
import numpy as np

# Bin events into 5-minute windows based on their start time
match_end = events['end_min'].max()
bins = np.arange(0, match_end + 5, 5)
events['window'] = pd.cut(events['start_min'], bins=bins, right=False)

# Count events per window, split by team
window_counts = events.groupby(['window', 'Team'])['code'].count().unstack(fill_value=0)
print("Events per 5-minute window by team:\n")
print(window_counts.to_string())

In [None]:
# Breakdown: what codes happen in each window for Barca?
barca_events = events[events['Team'] == 'FC Barcelona']

print("FC Barcelona event codes per 5-min window:\n")
barca_breakdown = barca_events.groupby(['window', 'code']).size().unstack(fill_value=0)
print(barca_breakdown.to_string())

## 7. Label columns - what extra info do events carry?

Beyond `code` and `Team`, events have labels like `Type`, `Side`, `Direction of ball entry`, `Max Players in the box`.

In [None]:
label_cols = ['Type', 'Side', 'Direction of ball entry', 'Max Players in the box', 'Half']

for col in label_cols:
    if col in events.columns:
        print(f'--- {col} ---')
        print(events[col].value_counts(dropna=False).to_string())
        print()

In [None]:
# Which codes use which labels?
print('Labels available per event code:\n')
for code in sorted(events['code'].unique()):
    subset = events[events['code'] == code]
    non_null_labels = [col for col in label_cols if col in subset.columns and subset[col].notna().any()]
    print(f'  {code}: {non_null_labels if non_null_labels else "(no labels)"}')

## 8. Tracking data - what is actually usable?

Quick check on how much position data exists per player track.

In [None]:
from data_loader import load_tracking

tracking = load_tracking(MATCH, limit=1000)
print(f'Tracking shape (1000 frames): {tracking.shape}')

# Check which players have actual position data
player_x_cols = [c for c in tracking.columns if c.endswith('_x') and c != 'ball_x']
non_null = tracking[player_x_cols].notna().sum().sort_values(ascending=False)

print(f'\nPlayers with data (out of {len(player_x_cols)} tracks):')
has_data = non_null[non_null > 0]
no_data = non_null[non_null == 0]
print(f'  With position data: {len(has_data)}')
print(f'  All NaN (no data): {len(no_data)}')

if len(has_data) > 0:
    print(f'\nPlayers with data:')
    for col, count in has_data.items():
        pid = col.replace('_x', '')
        print(f'  Player {pid}: {count}/{len(tracking)} frames ({100*count/len(tracking):.0f}%)')

## Summary

**What we learned:**
- Events are **team-level** tactical phase annotations (not player-level)
- Each event has a code (BUILD UP, PROGRESSION, etc.), team, start/end time, and optional labels (Side, Type, etc.)
- Events overlap - multiple phases can be active simultaneously
- Tracking data is **mostly empty** for non-goalkeeper players
- The events dataset is our primary signal for building peak windows

**Key question for the team:** Since events are team-level, how do we define a *player's* peak window? Options:
1. Use tracking to link player positions to team events (limited by sparse tracking)
2. Focus on team-level windows and let users pick a player via the Nexus video timestamps
3. Combine both - team windows from events + whatever player tracking is available