# Data Preprocessing
### Dataset: [NFL Big Data Bowl 2025](https://www.kaggle.com/competitions/nfl-big-data-bowl-2025/overview)

In [15]:
import pandas as pd
import numpy as np
import sklearn as sk
import tqdm 
from IPython.display import display, Markdown

In [18]:
games = pd.read_csv("data/games.csv")
players = pd.read_csv("data/players.csv")
plays = pd.read_csv("data/plays.csv")

In [17]:
with open("data/schema.md", "r") as file:
    markdown_content = file.read()

display(Markdown(markdown_content))


<!-- README.md is generated from README.Rmd. Please edit that file -->
Summary of data
---------------

Here, you'll find a summary of each data set in the 2019 Data Bowl, a list of *key* variables to join on, and a description of each variable.

File descriptions
-----------------

Game data: The `games.csv` file contains game-level information for each game from the first 6 weeks of the 2017 season. The *key* variable is **`gameId`**.

Play data: The `plays.csv` file contains play-level information from each game from the first 6 weeks of the 2017 season. The *key* variables are **`gameId`** and **`playId`**.

Player data: The `players.csv` file contains player-level information from players that participated in at least one play during the first six weeks of the 2017 regular season. The *key* variable is **`nflId`**.

Tracking data: Files `tracking_gameId_[gameId].csv` contain player tracking data from game `[gameId]`. Nearly all plays from `[gameId]` are included; certain plays with incomplete or missing data are dropped. The *key* variables are **`gameId`**, **`playId`**, and **`nflId`**.

Game data
---------

-   `season`: Season of game (numeric)
-   `week`: Week of game, 1 through 6 (numeric)
-   `gameDate`: Game Date (time, mm/dd/yyyy)
-   **`gameId`**: Game identifier, unique (numeric)
-   `gameTimeEastern`: Start time of game (time, HH:MM:SS, EST)
-   `HomeScore`: Final score for the home team (numeric)
-   `VisitorScore`: Final score for the away team (numeric)
-   `homeTeamAbbr`: Home team three-letter code (text)
-   `visitorTeamAbbr`: Visiting team three-letter code (text)
-   `homeDisplayName`: Home team name (text)
-   `visitorDisplayName`: Visiting team name (text)
-   `Stadium`: Stadium (text)
-   `Location`: City (text)
-   `StadiumType`: Type of stadium (text)
-   `Turf`: Surface of stadium (text)
-   `GameLength`: Time the game took to complete (time, HH:MM:SS)
-   `GameWeather`: Game weather (text)
-   `Temperature`: Temperature in Fahrenheit, drawn roughly at the start of the game (numeric)
-   `Humidity`: Humidity (numeric)
-   `WindSpeed`: Wind speed, in miles-per-hour (numeric)
-   `WindDirection`: Direction of wind (text)

Play data
---------

-   **`gameId`**: Game identifier, unique (numeric)
-   **`playId`**: Play identifier, not unique across games (numeric)
-   `quarter`: Game quarter (numeric)
-   `GameClock`: Time on game clock at start of play (time, counting down from 15:00, MM:SS)
-   `down`: Down (numeric)
-   `yardsToGo`: Distance needed for a first down (numeric)
-   `yardlineSide`: 3-letter team code corresponding to line-of-scrimmage (text)
-   `yardlineNumber`: Yard line at line-of-scrimmage (numeric)
-   `personnel.offense`: Personnel used by offensive team (text)
-   `defendersInTheBox`: Number of defenders in close proximity to line-of-scrimmage (numeric)
-   `numberOfPassRushers`: Number of pass rushers (numeric)
-   `personnel.defense`: Personnel used by defensive team (text)
-   `HomeScoreBeforePlay`: Home score prior to the play (numeric)
-   `VisitorScoreBeforePlay`: Visiting team points at the end of the play (numeric)
-   `HomeScoreAfterPlay`: Home team points at the end of the play (numeric)
-   `VisitorScoreAfterPlay`: Visitor team points at the end of the play (numeric)
-   `isPenalty`: TRUE/FALSE for whether or not a penalty was called on the play (binary)
-   `isSTPlay`: TRUE/FALSE for whether or not the play is labelled a special teams play (binary)
-   `SpecialTeamsPlayType`: Type of play if `isSTPlay == TRUE` (text)
-   `KickReturnYardage`: Return yardage among special teams plays (numeric)
-   `PassLength`: Pass length, in yards (numeric)
-   `PassResult`: Result of pass play (text, `C`: caught, `I`: incomplete, `IN`: intercepted, `R`: run, `S`: sack)
-   `YardsAfterCatch`: Yardage receiver gained after a pass completion (numeric)
-   `PlayResult`: Result of play, in yards (numeric)
-   `playDescription`: Description of play (text)

Player data
-----------

-   **`nflId`**: Player identification number, unique across players (numeric)
-   `FirstName`: First name of player (text)
-   `LastName`: Last name of player (text)
-   `PositionAbbr`: Position of player (text)
-   `EntryYear`: Year in which player entered NFL (numeric)
-   `DraftRound`: Round in which player was drafted --`NULL` for players not drafted (numeric)
-   `DraftNumber`: Overall pick number among drafted players (numeric)
-   `Height`: Player height in feet/inches (text)
-   `Weight`: Player weight in pounds (numeric)
-   `College`: Player college (text)

Tracking data
-------------

Files `tracking_gameId_[gameId].csv` contains player tracking data from game `[gameId]`. Nearly all plays from `[gameId]` are included; certain plays with insufficient data are dropped.

-   `time`: Time stamp of play (time, yyyy-mm-dd, hh:mm:ss)
-   `x`: Player position along the long axis of the field, 0 - 120 yards. See Figure 1 below. (numeric)
-   `y`: Player position along the short axis of the field, 0 - 53.3 yards. See Figure 1 below. (numeric)
-   `s`: Speed in yards/second (numeric)
-   `dis`: Distance traveled from prior time point, in yards (numeric)
-   `dir`: Angle of player motion (deg), 0 - 360 degrees (numeric)
-   `event`: Tagged play details, including moment of ball snap, pass release, pass catch, tackle, etc (text)
-   **`nflId`**: Player identification number, unique across players (numeric)
-   `displayName`: Player name (text)
-   `jerseyNumber`: Jersey number of player (numeric)
-   `team`: Team (away or home) of corresponding player (text)
-   `frame.id`: Frame identifier for each play, starting at 1 (numeric)
-   **`gameId`**: Game identifier, unique (numeric)
-   **`playId`**: Play identifier, not unique across games (numeric)

<img src="Extras/Fig1.PNG" align="right" />


In [21]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = plays

# Define target and pre-snap features
data["play_outcome"] = data["PassResult"].apply(lambda x: 1 if x == "C" else 0)
pre_snap_features = ["quarter", "down", "yardsToGo", "yardlineNumber", "offenseFormation", "defendersInTheBox"]
df = data[pre_snap_features + ["play_outcome"]].dropna()

# Encode categorical variables
df = pd.get_dummies(df, columns=["offenseFormation"], drop_first=True)

# Features and target
X = df.drop("play_outcome", axis=1)
y = df["play_outcome"]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Batch training with SGDClassifier
model = SGDClassifier(loss="log_loss", max_iter=1000, learning_rate="optimal", random_state=42)
batch_size = 32
n_samples = X_train.shape[0]

for i in range(0, n_samples, batch_size):
    X_batch = X_train[i:i + batch_size]
    y_batch = y_train[i:i + batch_size]
    model.partial_fit(X_batch, y_batch, classes=[0, 1])

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.5071047957371225
Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.51      0.57      1459
           1       0.36      0.51      0.42       793

    accuracy                           0.51      2252
   macro avg       0.51      0.51      0.50      2252
weighted avg       0.55      0.51      0.52      2252



In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess data
data = pd.read_csv("nfl_data_2025/plays.csv")
data["play_outcome"] = data["passResult"].apply(lambda x: 1 if x == "C" else 0)
pre_snap_features = ["quarter", "down", "yardsToGo", "yardlineNumber", "offenseFormation", "defendersInBox"]
df = data[pre_snap_features + ["play_outcome"]].dropna()
df = pd.get_dummies(df, columns=["offenseFormation"], drop_first=True)

X = df.drop("play_outcome", axis=1)
y = df["play_outcome"]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# --- Kernel Ridge Regression (KRR) with Kernel Tricks ---
# Note: KRR is for regression, so we'll treat play_outcome as a continuous score for demo purposes
# In practice, use logistic regression or SVC for classification

# Polynomial Kernel (degree=2)
krr_poly = KernelRidge(kernel="polynomial", degree=2, alpha=1.0)
krr_poly.fit(X_train, y_train)
y_pred_krr_poly = krr_poly.predict(X_test)
y_pred_krr_poly_class = [1 if x >= 0.5 else 0 for x in y_pred_krr_poly]  # Threshold for classification
print("KRR Polynomial Kernel Accuracy:", accuracy_score(y_test, y_pred_krr_poly_class))

# RBF Kernel
krr_rbf = KernelRidge(kernel="rbf", alpha=1.0, gamma="scale")
krr_rbf.fit(X_train, y_train)
y_pred_krr_rbf = krr_rbf.predict(X_test)
y_pred_krr_rbf_class = [1 if x >= 0.5 else 0 for x in y_pred_krr_rbf]
print("KRR RBF Kernel Accuracy:", accuracy_score(y_test, y_pred_krr_rbf_class))

# --- Support Vector Classifier (SVC) with Kernel Tricks ---
# Polynomial Kernel (degree=2)
svc_poly = SVC(kernel="poly", degree=2, C=1.0)
svc_poly.fit(X_train, y_train)
y_pred_svc_poly = svc_poly.predict(X_test)
print("SVC Polynomial Kernel Accuracy:", accuracy_score(y_test, y_pred_svc_poly))
print("SVC Polynomial Classification Report:\n", classification_report(y_test, y_pred_svc_poly))

# RBF Kernel
svc_rbf = SVC(kernel="rbf", C=1.0, gamma="scale")
svc_rbf.fit(X_train, y_train)
y_pred_svc_rbf = svc_rbf.predict(X_test)
print("SVC RBF Kernel Accuracy:", accuracy_score(y_test, y_pred_svc_rbf))
print("SVC RBF Classification Report:\n", classification_report(y_test, y_pred_svc_rbf))

FileNotFoundError: [Errno 2] No such file or directory: 'nfl_data_2025/plays.csv'