# Data Cleaning for xG-NextGen Project

This notebook performs data cleaning on raw soccer event data from StatsBomb to prepare it for modeling.

## Environment Setup

Import required libraries and mount Google Drive.

In [None]:
# 1. Mount Google Drive
to_mount = '/content/drive'
from google.colab import drive
drive.mount(to_mount)

# 2. Unzip raw_data.zip if not already extracted
import os
output_raw = f"{to_mount}/MyDrive/xG-NextGen/data/raw"
zip_path   = f"{output_raw}/raw_data.zip"
if not os.path.isdir(os.path.join(output_raw, 'events')):
    print(f"Unzipping {zip_path} to {output_raw} ...")
    get_ipython().system(f'unzip -q "{zip_path}" -d "{output_raw}"')
    print("Unzip complete.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import Libraries

Import required Python packages and helper functions.

In [None]:
!pip install pandas numpy xgboost scikit-learn shap matplotlib

import os
import sys
import json
import numpy as np
import pandas as pd
import importlib

# Add scripts directory to path & reload utils to pick up changes
tools_path = f"{to_mount}/MyDrive/xG-NextGen/scripts"
sys.path.insert(0, tools_path)
import utils
importlib.reload(utils)
from utils import load_shot_data, get_freeze_frame



## Set File Paths

Define input and output directories for data processing.

In [None]:
# Define input and output directories
input_dir  = f"{to_mount}/MyDrive/xG-NextGen/data/raw/events"
output_dir = f"{to_mount}/MyDrive/xG-NextGen/data/processed"
os.makedirs(output_dir, exist_ok=True)
print(f"Input directory: {input_dir}")
print(f"Output directory: {output_dir}")

Input directory: /content/drive/MyDrive/xG-NextGen/data/raw/events
Output directory: /content/drive/MyDrive/xG-NextGen/data/processed


## Quick Utils.py Sanity Check

In [None]:
# Force‑reload the module so Colab picks up your latest .py edits
import importlib, utils
importlib.reload(utils)

# Load just 5 matches to verify 'match_id' is present
shots_test = utils.load_shot_data(input_dir, limit=5)
print("Sample shot fields:", shots_test[0].keys())

Sample shot fields: dict_keys(['id', 'index', 'period', 'timestamp', 'minute', 'second', 'type', 'possession', 'possession_team', 'play_pattern', 'team', 'player', 'position', 'location', 'duration', 'related_events', 'shot', 'match_id', 'goal_difference', 'is_home', 'assist_type', 'n_prev_passes'])


## Load Raw Shot Data

Load shot events from StatsBomb JSON files with additional features.

In [None]:
print("Loading shot data...")
shots = load_shot_data(input_dir)
print(f"Successfully loaded {len(shots)} shot events")

Loading shot data...
Successfully loaded 87111 shot events


## Create Shots DataFrame

Convert the shot events into a structured DataFrame and perform cleaning.

In [None]:
shots_df = pd.DataFrame(shots)

# 🆕 we now expect these two new keys in each shot dict:
selected_cols = [
    'id','match_id','timestamp','location','shot',
    'goal_difference','is_home','position',
    'assist_type','n_prev_passes'
]
shots_df = shots_df[selected_cols].rename(columns={'id':'shot_id'})

shots_df = shots_df.dropna(subset=['location','shot'])
shots_df['freeze_frame'] = shots_df['shot'].apply(lambda x: x.get('freeze_frame',[]))
shots_df['position'] = shots_df['position'].fillna('Unknown')

shots_df['shot_type']    = shots_df['shot'].apply(lambda x: x.get('type',{}).get('name'))
shots_df['shot_outcome'] = shots_df['shot'].apply(lambda x: x.get('outcome',{}).get('name'))
shots_df['body_part']    = shots_df['shot'].apply(lambda x: x.get('body_part',{}).get('name'))
shots_df = shots_df.dropna(subset=['shot_type','shot_outcome'])

shots_df['x'] = shots_df['location'].apply(lambda loc: loc[0] if isinstance(loc,list) and len(loc)>=2 else np.nan)
shots_df['y'] = shots_df['location'].apply(lambda loc: loc[1] if isinstance(loc,list) and len(loc)>=2 else np.nan)
shots_df = shots_df.dropna(subset=['x','y'])

shots_df['timestamp'] = pd.to_datetime(shots_df['timestamp'],format='%H:%M:%S.%f',errors='coerce')
shots_df['goal'] = shots_df['shot_outcome'].apply(lambda o: 1 if o=='Goal' else 0)

final_cols = [
    'shot_id','match_id','timestamp','x','y',
    'shot_type','shot_outcome','body_part',
    'goal_difference','is_home','position',
    'assist_type','n_prev_passes',
    'goal','freeze_frame'
]
shots_df = shots_df[final_cols]
print("Cleaned shots DF shape:", shots_df.shape)

Cleaned shots DF shape: (87111, 15)


## Process Freeze Frame Data

Convert freeze frame information into a structured DataFrame.

In [None]:
print("Processing freeze frames...")
freeze_frames = []
for _, row in shots_df.iterrows():
    ff = row['freeze_frame']
    if isinstance(ff, list) and ff:
        try:
            df_ff = get_freeze_frame(row['shot_id'], ff)
            freeze_frames.append(df_ff)
        except Exception as e:
            print(f"Error on shot {row['shot_id']}: {e}")

if freeze_frames:
    freeze_df = pd.concat(freeze_frames, ignore_index=True)
    print(f"Freeze frames DataFrame shape: {freeze_df.shape}")
else:
    freeze_df = pd.DataFrame()
    print("No freeze frame data found.")

Processing freeze frames...
Freeze frames DataFrame shape: (1110839, 6)


## Save Cleaned Data

Save the processed data to CSV files for future use.

In [None]:
shots_out = os.path.join(output_dir, 'shots.csv')
shots_df.drop('freeze_frame', axis=1).to_csv(shots_out, index=False)
print(f"Saved shots to {shots_out}")

if not freeze_df.empty:
    ff_out = os.path.join(output_dir, 'freeze_frames.csv')
    freeze_df.to_csv(ff_out, index=False)
    print(f"Saved freeze frames to {ff_out}")

Saved shots to /content/drive/MyDrive/xG-NextGen/data/processed/shots.csv
Saved freeze frames to /content/drive/MyDrive/xG-NextGen/data/processed/freeze_frames.csv


## Data Summary

Display summary statistics of the cleaned data.

In [None]:
print("\nShots Data Summary:")
print(shots_df.info())
print("\nGoal Distribution:")
print(shots_df['goal'].value_counts(normalize=True))
if not freeze_df.empty:
    print("\nFreeze Frames Summary:")
    print(freeze_df.info())


Shots Data Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87111 entries, 0 to 87110
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   shot_id          87111 non-null  object        
 1   match_id         87111 non-null  object        
 2   timestamp        87111 non-null  datetime64[ns]
 3   x                87111 non-null  float64       
 4   y                87111 non-null  float64       
 5   shot_type        87111 non-null  object        
 6   shot_outcome     87111 non-null  object        
 7   body_part        87111 non-null  object        
 8   goal_difference  87111 non-null  int64         
 9   is_home          87111 non-null  bool          
 10  position         87111 non-null  object        
 11  assist_type      87111 non-null  object        
 12  n_prev_passes    87111 non-null  int64         
 13  goal             87111 non-null  int64         
 14  freeze_frame     