### Data Processing Pipeline
  
*Gian Favero and Michael Montemurri, Mila, 2024*

This notebook performs the following steps for the 2025 NFL Big Data Bowl competition:
1. Load raw data from `player_play.csv`, `plays.csv`, and `tracking_week_X.csv`.
2. Clean and preprocess data. 
3. Save data to be used later on downstream tasks.

In [1]:
import random
import os
import torch
import numpy as np

root_dir = os.getcwd()

# Go back a directory to access the data folder
os.chdir(os.path.join(root_dir, '..'))

from scripts.data_cleaning import clean_data, aggregate_data, strip_unused_data

# set manual custom seed for reproducibility
def set_random_seed(value): 
    g = torch.manual_seed(value)   
    np.random.seed(value)
    random.seed(value)
    torch.backends.cudnn.deterministic=True
    return g

# set seed
set_random_seed(42)

<torch._C.Generator at 0x7ff0ebfcf610>

### Pre-Processing Steps

1. Based on "Uncovering Tackle Opportunities and Missed Opportunities", a 2024 NFL Big Data Bowl Finalist
2. All plays are flipped such that xy-coordinate based data is for a team driving left to right
3. All player orientation (angle) is from a reference of 0 degrees (right) and rotates counter clockwise
4. Plays nullified by penalties are removed (there are none)
5. Plays that are a QB kneel, spike, or sneak are removed.
6. Plays that occur when `preSnapHomeTeamWinProbability` or `preSnapVisitorTeamWinProbability` are greater than 95% are removed. This is commonly referred to as "garbage time" and the losing team often stat pads here.
7. `player_play.csv` and `tracking_week_X.csv` are merged on the `["gameId", "playId", "nflId"]` axes, 
which is then merged with plays.csv on the `["gameId", "playId", "nflId"]` axes.

In [2]:
# Set paths to local data files
print(root_dir)
players_fname = os.path.join(root_dir, "../raw/players.csv")
plays_fname = os.path.join(root_dir, "../raw/plays.csv")
player_play_fname = os.path.join(root_dir, "../raw/player_play.csv")
tracking_fname_list = [os.path.join(root_dir, f"../raw/tracking_week_{i}.csv") for i in range(1,10)]

# Aggregate data from the plays.csv, players.csv, and any tracking data into one aggregate dataframe.
df = aggregate_data(
    players_fname=players_fname, 
    plays_fname=plays_fname,
    player_play_fname=player_play_fname, 
    tracking_fname_list=tracking_fname_list,
    )

# Preprocess and clean the data
active_frames = 'presnap' # ['at_snap', 'presnap', 'postsnap', 'all']
df_clean = clean_data(df, 'at_snap') 

# Reduce the size of the dataframe by removing unnecessary columns
useful_columns = [
        "gameId",
        "playId",
        "frameId",
        "nflId",
        "displayName",
        "position",
        "club",
        "possessionTeam",
        "defensiveTeam",
        "preSnapHomeScore",
        "preSnapVisitorScore",
        "quarter",
        "gameClock",
        "down",
        "yardsToGo",
        "yardlineNumber",
        "yardlineSide",
        "offenseFormation",
        "receiverAlignment",
        "preSnapHomeTeamWinProbability",
        "preSnapVisitorTeamWinProbability",
        "o_clean",
        "a_clean",
        "s_clean",
        "x_clean",
        "y_clean",
        "dir_clean",
        "playDescription",
        "passLocationType",
        "rushLocationType",
        "pff_runConceptPrimary",
        "yardsGained",
        "wasInitialPassRusher",
    ]

df_reduced = strip_unused_data(df_clean, useful_columns)

print(df_reduced.head())

/cim/faverog/BigData25/data/scripts
INFO: Aggregating data from players, play data, tracking data, and players data into a master dataframe...
INFO: Loaded 16124 rows of plays, 354727 rows of player plays, and 59327373 rows of player tracking data
INFO: Aggregated dataframe has 56747802 rows
INFO: Transforming orientation and direction angles so that 0° points from left to right, and increasing angle goes counterclockwise...
INFO: Flipping plays so that they all run from left to right...
INFO: Removing QB kneels, spikes, sneaks...
INFO: 548174 rows removed
INFO: Removing inactive frames...
INFO: 55508016 rows removed
INFO: Removing garbage time frames...
INFO: 100408 rows removed
INFO: Converting geometry variables from floats to int...
INFO: Removing unused columns from dataframe...
INFO: 95 columns removed
shape: (5, 33)
┌────────────┬────────┬─────────┬───────┬───┬─────────────┬─────────────┬─────────────┬────────────┐
│ gameId     ┆ playId ┆ frameId ┆ nflId ┆ … ┆ rushLocatio ┆ pff_

In [3]:
csv = True

if csv:
    # Save the cleaned data to a csv file
    df_reduced.write_csv(os.path.join(root_dir, "../processed/df_clean.csv"))