# NFL Big Data Bowl 2025 — Anticipating the Defense: A Machine Learning Approach to Run Play Gap Prediction

**Author**: Andrew Le and Bryan Trinh

**Track**: Metric Track

**Word Count**: ~1,950 (Main text, excluding Appendix)


## Introduction

**Context**

In football, when a team decides to run the ball, specific blocking schemes are designed to create advantageous running lanes for the ball carrier. Offensive linemen execute these schemes with the goal of neutralizing defenders and opening gaps that the runner can exploit to gain positive yardage. However, the success of a run play depends not only on the effectiveness of the blocking but also on the ball carrier's ability to make split-second decisions, reacting to defensive movements, and choosing the optimal path to advance the ball.

Our objective is to develop a predictive model capable of assessing the rate of success in forecasting which gaps defensive players will fill during a run play based on their pre-snap positioning. By analyzing player location data prior to the snap, the model aims to identify defensive tendencies, helping teams anticipate defensive reactions and make more informed decisions in play design and execution. This insight could provide valuable strategic advantages, such as adjusting blocking assignments or influencing play-calling strategies to maximize offensive efficiency. 

## Data Overview

**Files:**
- **tracking_week_.csv**: 'frameId', 'frameType', 'x', 'y'
- **player_play.csv**: 'gameId', 'playId', 'nflId'
- **players.csv**: 'gameId', 'playId', 'playDescription', 'yardsToGo', 'playNullifiedByPenalty', 'qbSpike', 'qbKneel', 'absoluteYardlineNumber', 'yardsGained',  'qbSneak',  'pff_runPassOption', 'rushLocationType', 'passResult', 'pff_runConceptPrimary', 'pff_runConceptSecondary'
- **plays.csv**: 'nflId', 'position'

**Preprocess Strategy**  
1. Filter out plays that were nullified by penalties, quarterback kneels, quarterback spikes, and quarterback sneaks.
2. Flip plays that are in the left direction and invert to the right. Subsequently change 'x' and 'y' values to account for field flip.
3. Isolate relevant players (i.e runningback, offensive linemen, all defensive players).
4. Isolate running plays only (remove run-play option plays as well).
5. Remove final columns that are not relevant to model.


## Feature Engineering


**Final Variables**

   - **gameId**: 
   - **playId**: 
   - **right_c**: 
   - **right_b**: 
   - **right_a**: 
   - **left_a**: 
   - **left_b**: 
   - **left_c**: 
   - **sequence**: 


## Model Approach

Train **Multi-Output LSTM Network** model for six primary tasks:

1. **right_a_auc** (`binary:logistic`, AUC)  
2. **right_b_auc** (`binary:logistic`, AUC)  
3. **right_c_auc** (`binary:logistic`, AUC)    
4. **left_a_auc** (`binary:logistic`, AUC)
5. **left_b_auc** (`binary:logistic`, AUC)
6. **left_c_auc** (`binary:logistic`, AUC)

**Training Process**  
- 80/20 train-test split.  
- **5-Fold Cross-Validation** was used to tune hyperparameters, including the number of LSTM units (256, 64, 32), dropout rates (0.2, 0.5), L2 regularization (0.001), and the learning rate schedule (initial learning rate of 0.001 with exponential decay).

## Conclusions


Our model yields 66.01-75.55% accuracy rate.


## 8. Appendix (Full Code):

https://github.com/letriandrew/Anticipating-the-Defense

https://www.linkedin.com/in/letriandrew/

https://www.linkedin.com/in/-bryantrinh/


***Test***

You must have the following file structure:

Project/
├─ code/
│  ├─ organize_data.py
│  ├─ preprocess_data.py
│  ├─ report.ipynb
├─ data/
│  ├─ .gitignore
│  ├─ processed/
│  ├─ players.csv
│  ├─ plays.csv
│  ├─ games.csv
│  ├─ tracking_week_{1-9}.csv
│  ├─ player_play.csv

The order of execution is as follows:

1. code/preprocess_data.py
2. code/organize_data.py
3. code/feature_engineering.py
4. code/model.py

In [3]:
import preprocess_data as pp_data
import organize_data  as o_data
import feature_engineering as fe
import model as m

import importlib
importlib.reload(pp_data)
importlib.reload(o_data)
importlib.reload(fe)
importlib.reload(m)


pp_data_prompt = input("Did you want to preprocess data (y for YES) ").lower()
if pp_data_prompt == "y":
    pp_data.create_final_tracking_week()

o_data_prompt = input("Did you want to organize_data (y for YES) ").lower()
if o_data_prompt == "y":
    o_data.organize_data()

fe_prompt = input("Did you want to feature engineer (y for YES) ").lower()
if fe_prompt == "y":
    fe.feature_engineer()

model_prompt = input("Did you want to model data (y for YES) ").lower()
if model_prompt == "y":
    model = m.model()


Augmenting Week 1
Week 1 processing complete.
Augmenting Week 2
Week 2 processing complete.
Augmenting Week 3
Week 3 processing complete.
Augmenting Week 4
Week 4 processing complete.
Augmenting Week 5
Week 5 processing complete.
Augmenting Week 6
Week 6 processing complete.
Augmenting Week 7
Week 7 processing complete.
Augmenting Week 8
Week 8 processing complete.
Augmenting Week 9
Week 9 processing complete.
Iterating over c:\Users\Bryan\Documents\VS Code\BDB2025/data/processed/ tracking files
Total of 36 files to iterate
36/36 100.000 percent complete         
Task took 48.724 seconds
Starting feature engineering
Aggregated data with gap columns has been saved.
Week 1 processed in 108.83 seconds.
Aggregated data with gap columns has been saved.
Week 2 processed in 105.07 seconds.
Aggregated data with gap columns has been saved.
Week 3 processed in 106.03 seconds.
Aggregated data with gap columns has been saved.
Week 4 processed in 113.16 seconds.
Aggregated data with gap columns has

None
Saved model as: c:\Users\Bryan\Documents\VS Code\BDB2025/data/model.keras
