### Data Processing Pipeline
  
*Gian Favero and Michael Montemurri, Mila, 2024*

This notebook performs the following steps for the 2025 NFL Big Data Bowl competition:
1. Load raw data from `players.csv`, `player_play.csv`, `plays.csv`, and `tracking_week_X.csv`.
2. Clean and preprocess data. 
3. Save data to be used later on downstream tasks.

In [1]:
import random
import os
import torch
import numpy as np

root_dir = os.getcwd()
print(root_dir)

# Go back a directory to access the data folder
os.chdir(os.path.join(root_dir, '..'))

from data.scripts.data_cleaning import clean_data, aggregate_data, strip_unused_data

# set manual custom seed for reproducibility
def set_random_seed(value): 
    g = torch.manual_seed(value)   
    np.random.seed(value)
    random.seed(value)
    torch.backends.cudnn.deterministic=True
    return g

# set seed
set_random_seed(42)

/cim/faverog/BigData25/notebooks


<torch._C.Generator at 0x7fa79cbbf6b0>

### Pre-Processing Steps

1. Based on "Uncovering Tackle Opportunities and Missed Opportunities", a 2024 NFL Big Data Bowl Finalist
2. All plays are flipped such that xy-coordinate based data is for a team driving left to right
3. All player orientation (angle) is from a reference of 0 degrees (right) and rotates counter clockwise
4. Plays nullified by penalties are removed (there are none)
5. Plays that are a QB kneel, spike, or sneak are removed.
6. Plays that occur when `preSnapHomeTeamWinProbability` or `preSnapVisitorTeamWinProbability` are greater than 95% are removed. This is commonly referred to as "garbage time" and the losing team often stat pads here.
7. `player_play.csv`, `players.csv`, and `tracking_week_X.csv` are merged on the `["gameId", "playId", "nflId"]` axes, 
which is then merged with plays.csv on the `["gameId", "playId", "nflId"]` axes.

In [2]:
# Set paths to local data files
players_fname = os.path.join("data/raw/players.csv")
plays_fname = os.path.join("data/raw/plays.csv")
player_play_fname = os.path.join("data/raw/player_play.csv")
games_fname = os.path.join("data/raw/games.csv")
tracking_fname_list = [os.path.join(f"data/raw/tracking_week_{i}.csv") for i in range(1,10)]

# Aggregate data from the plays.csv, players.csv, and any tracking data into one aggregate dataframe.
df = aggregate_data(
    players_fname=players_fname, 
    plays_fname=plays_fname,
    player_play_fname=player_play_fname, 
    games_fname=games_fname,
    tracking_fname_list=tracking_fname_list,
    )

# Preprocess and clean the data 
df_clean = clean_data(df, 'at_snap') # ['at_snap', 'presnap', 'postsnap', 'all']

# Reduce the size of the dataframe by removing unnecessary columns
game_context_columns = [
        "gameId",
        "playId",
        "homeTeamAbbr",
        "visitorTeamAbbr",
        "frameId",
        "nflId",
        "displayName",
        "position",
        "club",
        "down",
        "quarter",
        "yardsToGo",
        "possessionTeam",
        "defensiveTeam",
        "yardlineSide",
        "yardlineNumber",
        "gameClock",
        "preSnapHomeScore",
        "preSnapVisitorScore",
        "event",
    ]

# Offensive formation, receiver alignment, and pre-snap win probabilities related to OC
offense_columns = [
        "offenseFormation",
        "receiverAlignment",
        "preSnapHomeTeamWinProbability",
        "preSnapVisitorTeamWinProbability",
    ]

# Defensive formation, pass coverage, and run concept related to DC
defensive_columns = [
        "pff_manZone",
        "pff_passCoverage",
        "wasInitialPassRusher",
        "o_clean",
        "a_clean",
        "s_clean",
        "x_clean",
        "y_clean",
        "dir_clean",
]

# Play description, pass location, rush location, and PFF run concept related to play call
play_columns = [
        "playDescription",
        "playAction",
        "passLocationType",
        "rushLocationType",
        "pff_runConceptPrimary",
    ]

# Yards gained, event, and win probability added related to play outcome
outcome_columns = [
        "yardsGained",
        "homeTeamWinProbabilityAdded",
        "visitorTeamWinProbilityAdded",
    ]

# Combine all columns
useful_columns = game_context_columns + offense_columns + defensive_columns + play_columns + outcome_columns

df_reduced = strip_unused_data(df_clean, useful_columns)

print(df_reduced.head())

print(df_reduced.columns)

INFO: Aggregating data from players, play data, tracking data, and players data into a master dataframe...
INFO: Loaded 16124 rows of plays, 354727 rows of player plays, and 59327373 rows of player tracking data
INFO: Aggregated dataframe has 56747802 rows
INFO: Removing inactive frames...
INFO: 56042924 rows removed
INFO: Removing garbage time frames...
INFO: 107008 rows removed
INFO: Transforming orientation and direction angles so that 0° points from left to right, and increasing angle goes counterclockwise...
INFO: Flipping plays so that they all run from left to right...
INFO: Removing QB kneels, spikes, sneaks...
INFO: 6666 rows removed
INFO: Converting geometry variables from floats to int...
INFO: Removing unused columns from dataframe...
INFO: 95 columns removed
shape: (5, 41)
┌────────────┬────────┬────────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ gameId     ┆ playId ┆ homeTeamAb ┆ visitorTe ┆ … ┆ pff_runCo ┆ yardsGain ┆ homeTeamW ┆ visitorTe │
│

In [3]:
csv = True

if csv:
    # Save the cleaned data to a csv file
    df_reduced.write_csv(os.path.join("data/processed/df_clean.csv"))