This is the data setup for an NFL machine learning model. This file will be for the data ingestion and transformation, and I will have multiple other files for different types of modeling.

# Imports & Housekeeping

In [1]:
# Basic Packages
import pandas as pd
import numpy as np
from functools import reduce
from datetime import datetime as dt

# Visualizations
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Connecting to drive to bring in data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Notebook display options
pd.options.display.float_format = '{:,.2f}'.format
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,8)
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 50)

# Data Import

This is not the full raw data. This data was acquired in it's raw form from nfl_data_py api. Transformations and EPA calculations were done in Pycharm because of easier access to classes, functions, etc. At this point, Google Notebooks is a better place for the modeling. Credit: https://github.com/cooperdff/nfl_data_py.

In [5]:
nfl_api = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/API Data/test_2020_to_2023.csv", index_col=0)
nfl_stathead = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/master_data_2017-2022.csv", index_col=0)
schedule = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/API Data/schedule_2020_to_2023.csv", index_col=0)

In [6]:
# We want a df that we will model on without anything that isn't a feature or target
model_df = nfl_api

model_df

Unnamed: 0,game_id,season,week,team,opponent,score,home,...,ewma_dynamic_window_rushing_defense_team,ewma_dynamic_window_passing_defense_team,ewma_dynamic_window_rushing_offense_opp,ewma_dynamic_window_passing_offense_opp,ewma_dynamic_window_rushing_defense_opp,ewma_dynamic_window_passing_defense_opp,team_id
0,2020_01_ARI_SF,2020,1,SF,ARI,20,1,...,,,,,,,2020_01_ARI_SF_H
1,2020_01_ARI_SF,2020,1,ARI,SF,24,0,...,,,,,,,2020_01_ARI_SF_A
2,2020_01_CHI_DET,2020,1,DET,CHI,23,1,...,,,,,,,2020_01_CHI_DET_H
3,2020_01_CHI_DET,2020,1,CHI,DET,27,0,...,,,,,,,2020_01_CHI_DET_A
4,2020_01_CLE_BAL,2020,1,BAL,CLE,38,1,...,,,,,,,2020_01_CLE_BAL_H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1671,2022_21_CIN_KC,2022,21,CIN,KC,20,0,...,-0.08,-0.01,0.02,0.23,0.02,-0.05,2022_21_CIN_KC_A
1672,2022_21_SF_PHI,2022,21,PHI,SF,31,1,...,0.03,-0.11,-0.04,0.24,-0.11,-0.09,2022_21_SF_PHI_H
1673,2022_21_SF_PHI,2022,21,SF,PHI,7,0,...,-0.11,-0.09,0.09,0.05,0.03,-0.11,2022_21_SF_PHI_A
1674,2022_22_KC_PHI,2022,22,PHI,KC,35,1,...,-0.02,-0.13,0.00,0.23,0.03,-0.05,2022_22_KC_PHI_H


Ok, this is our starting df. The target is score_diff - The difference in score between two teams. In this case, the score diff is with respect to the home team. This mean if the diff is positive, the home team won. We will keep this in mind for bringing in other features.

The EPA columns are created by bringing in play by play data, rolling them up into games, and then creating a 10 game rolling average offset by one week so that each EPA that we see is an average of the last 10 games. If the team has not played 10 games in the season, it will take as many games as have been played.

# (Other) Feature Engineering

Adding in a couple of features I think will be valuable to model on

In [13]:
feature_df = schedule
feature_df.columns

Index(['game_id', 'season', 'game_type', 'week', 'gameday', 'weekday',
       'gametime', 'away_team', 'away_score', 'home_team', 'home_score',
       'location', 'result', 'total', 'overtime', 'old_game_id', 'gsis',
       'nfl_detail_id', 'pfr', 'pff', 'espn', 'ftn', 'away_rest', 'home_rest',
       'away_moneyline', 'home_moneyline', 'spread_line', 'away_spread_odds',
       'home_spread_odds', 'total_line', 'under_odds', 'over_odds', 'div_game',
       'roof', 'surface', 'temp', 'wind', 'away_qb_id', 'home_qb_id',
       'away_qb_name', 'home_qb_name', 'away_coach', 'home_coach', 'referee',
       'stadium_id', 'stadium', 'windy', 'rest differential'],
      dtype='object')

In [28]:
# Was wind involved in the game. I'm setting the threshold at 15mph, but may adjust later
feature_df['windy'] = np.where(schedule.loc[:, 'wind'] > 15, 1, 0)

# Next lets look at "rest differential". A further analysis of this can be seen on my github
feature_df['rest_differential'] = schedule.loc[:, "home_rest"] - schedule.loc[:, "away_rest"]
feature_df

Unnamed: 0,game_id,season,game_type,week,gameday,weekday,gametime,...,referee,stadium_id,stadium,windy,rest differential,implied_points,rest_differential
5583,2020_01_HOU_KC,2020,REG,1,2020-09-10,Thursday,20:20,...,Clete Blakeman,KAN00,Arrowhead Stadium,0,0,22.00,0
5584,2020_01_SEA_ATL,2020,REG,1,2020-09-13,Sunday,13:00,...,Shawn Hochuli,ATL97,Mercedes-Benz Stadium,0,0,24.25,0
5585,2020_01_CLE_BAL,2020,REG,1,2020-09-13,Sunday,13:00,...,Ronald Torbert,BAL00,M&T Bank Stadium,0,0,20.00,0
5586,2020_01_NYJ_BUF,2020,REG,1,2020-09-13,Sunday,13:00,...,Shawn Smith,BUF00,New Era Field,0,0,16.50,0
5587,2020_01_LV_CAR,2020,REG,1,2020-09-13,Sunday,13:00,...,Brad Allen,CAR00,Bank of America Stadium,0,0,25.50,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6688,2023_18_ATL_NO,2023,REG,18,2024-01-07,Sunday,13:00,...,,NOR00,Mercedes-Benz Superdome,0,-1,,-1
6689,2023_18_PHI_NYG,2023,REG,18,2024-01-07,Sunday,13:00,...,,NYC01,MetLife Stadium,0,0,,0
6690,2023_18_LA_SF,2023,REG,18,2024-01-07,Sunday,13:00,...,,SFO01,Levi's Stadium,0,-7,,-7
6691,2023_18_JAX_TEN,2023,REG,18,2024-01-07,Sunday,13:00,...,,NAS00,Nissan Stadium,0,0,,0


Ok lets reduce the columns to the ones we need and merge with our original df

In [29]:
merge_df = feature_df[['game_id', 'rest_differential', 'windy', 'div_game', 'spread_line', 'implied_points', 'total_line', 'game_type', 'location', 'total']]

In [30]:
final_df = pd.merge(model_df, merge_df, on="game_id", how='left')

# Only reg season games
final_df = final_df.loc[final_df.game_type == "REG"]

In [31]:
# Since we're now doing 1 team per row, need to make some adjustments to the values
final_df["implied_points"] = np.where(final_df.home == 1, final_df.total_line/2 - final_df.spread_line/2, final_df.total_line/2 + final_df.spread_line/2)
final_df.spread_line = np.where(final_df.home == 1, final_df.spread_line, final_df.spread_line*-1)
final_df.rest_differential = np.where(final_df.home == 1, final_df.rest_differential, final_df.rest_differential*-1)

In [32]:
final_df.head(50)

Unnamed: 0,game_id,season,week,team,opponent,score,home,...,div_game,spread_line,implied_points,total_line,game_type,location,total
0,2020_01_ARI_SF,2020,1,SF,ARI,20,1,...,1,7.0,20.75,48.5,REG,Home,44.0
1,2020_01_ARI_SF,2020,1,ARI,SF,24,0,...,1,-7.0,27.75,48.5,REG,Home,44.0
2,2020_01_CHI_DET,2020,1,DET,CHI,23,1,...,1,2.5,20.0,42.5,REG,Home,50.0
3,2020_01_CHI_DET,2020,1,CHI,DET,27,0,...,1,-2.5,22.5,42.5,REG,Home,50.0
4,2020_01_CLE_BAL,2020,1,BAL,CLE,38,1,...,1,7.0,20.0,47.0,REG,Home,44.0
5,2020_01_CLE_BAL,2020,1,CLE,BAL,6,0,...,1,-7.0,27.0,47.0,REG,Home,44.0
6,2020_01_DAL_LA,2020,1,LA,DAL,20,1,...,0,1.0,25.5,52.0,REG,Home,37.0
7,2020_01_DAL_LA,2020,1,DAL,LA,17,0,...,0,-1.0,26.5,52.0,REG,Home,37.0
8,2020_01_GB_MIN,2020,1,MIN,GB,34,1,...,1,1.0,22.0,45.0,REG,Home,77.0
9,2020_01_GB_MIN,2020,1,GB,MIN,43,0,...,1,-1.0,23.0,45.0,REG,Home,77.0


ok lets output back to csv so that we can bring it in the other files to model on without messing with the final result

In [33]:
final_df.to_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/Model Data/final_data.csv")

Okay we should probably do some visualizations to see what this data actually is telling us