<a href="https://colab.research.google.com/github/nickbohall/NFL_Betting_Model/blob/main/NFL_Model_Setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the data setup for an NFL machine learning model. This file will be for the data ingestion and transformation, and I will have multiple other files for different types of modeling.

# Imports & Housekeeping

In [None]:
# Basic Packages
import pandas as pd
import numpy as np
from functools import reduce
from datetime import datetime as dt

# Visualizations
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Connecting to drive to bring in data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Notebook display options
pd.options.display.float_format = '{:,.2f}'.format
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,8)
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 50)

# Data Import

This is not the full raw data. This data was acquired in it's raw form from nfl_data_py api. Transformations and EPA calculations were done in Pycharm because of easier access to classes, functions, etc. At this point, Google Notebooks is a better place for the modeling. Credit: https://github.com/cooperdff/nfl_data_py. 

In [None]:
nfl_api = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/API Data/data_2002_to_2022.csv", index_col=0)
nfl_stathead = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/master_data_2017-2022.csv", index_col=0)
schedule = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/API Data/schedule_2002_to_2022.csv", index_col=0)

In [None]:
# We want a df that we will model on without anything that isn't a feature or target
model_df = nfl_api
model_df['total_score'] = model_df['home_score'] + model_df['away_score']
model_df_columns = [column for column in nfl_api.columns if ('ewma' in column and 'dynamic' in column) or 'score_diff' in column or 'game_id' in column or 'total_score' in column]
model_df = nfl_api[model_df_columns]

model_df

Unnamed: 0,game_id,season,week,home_team,away_team,home_score,score_diff,...,epa_shifted_rushing_defense_away,ewma_rushing_defense_away,ewma_dynamic_window_rushing_defense_away,epa_passing_defense_away,epa_shifted_passing_defense_away,ewma_passing_defense_away,ewma_dynamic_window_passing_defense_away
0,2002_01_ARI_WAS,2002,1,WAS,ARI,31,8,...,,,,0.18,,,
1,2002_01_ATL_GB,2002,1,GB,ATL,37,3,...,,,,0.31,,,
2,2002_01_BAL_CAR,2002,1,CAR,BAL,10,3,...,,,,0.04,,,
3,2002_01_DAL_HOU,2002,1,HOU,DAL,19,9,...,,,,-0.33,,,
4,2002_01_DET_MIA,2002,1,MIA,DET,49,28,...,,,,0.71,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5639,2022_20_JAX_KC,2022,20,KC,JAX,27,7,...,-0.12,-0.14,-0.10,0.34,0.17,-0.01,0.03
5640,2022_20_NYG_PHI,2022,20,PHI,NYG,38,31,...,-0.02,0.07,0.06,0.24,0.25,-0.00,0.00
5641,2022_21_CIN_KC,2022,21,KC,CIN,23,3,...,0.06,-0.11,-0.08,0.19,-0.02,-0.02,-0.01
5642,2022_21_SF_PHI,2022,21,PHI,SF,31,24,...,0.04,-0.09,-0.11,-0.03,-0.17,-0.11,-0.09


Ok, this is our starting df. The target is score_diff - The difference in score between two teams. In this case, the score diff is with respect to the home team. This mean if the diff is positive, the home team won. We will keep this in mind for bringing in other features. 

The EPA columns are created by bringing in play by play data, rolling them up into games, and then creating a 10 game rolling average offset by one week so that each EPA that we see is an average of the last 10 games. If the team has not played 10 games in the season, it will take as many games as have been played.

We need to adjust the vegas lines. The ones we have stop in 2018. We're adding in some other data and averaging them to get a new spread column

# (Other) Feature Engineering

Adding in a couple of features I think will be valuable to model on

In [None]:
feature_df = schedule

In [None]:
# Was wind involved in the game. I'm setting the threshold at 15mph, but may adjust later
feature_df['windy'] = np.where(schedule.loc[:, 'wind'] > 15, 1, 0)

# Next lets look at "rest differential". A further analysis of this can be seen on my github 
feature_df['rest differential'] = schedule.loc[:, "home_rest"] - schedule.loc[:, "away_rest"]
feature_df

Unnamed: 0,game_id,season,game_type,week,gameday,weekday,gametime,...,away_coach,home_coach,referee,stadium_id,stadium,windy,rest differential
777,2002_01_SF_NYG,2002,REG,1,2002-09-05,Thursday,20:30,...,Steve Mariucci,Jim Fassel,Gerry Austin,NYC00,Giants Stadium,0,0
778,2002_01_NYJ_BUF,2002,REG,1,2002-09-08,Sunday,13:00,...,Herm Edwards,Gregg Williams,Bob McElwee,BUF00,Ralph Wilson Stadium,0,0
779,2002_01_BAL_CAR,2002,REG,1,2002-09-08,Sunday,13:00,...,Brian Billick,John Fox,Walt Coleman,CAR00,Ericsson Stadium,0,0
780,2002_01_MIN_CHI,2002,REG,1,2002-09-08,Sunday,13:00,...,Mike Tice,Dick Jauron,Mike Carey,CHI99,Memorial Stadium (Champaign),0,0
781,2002_01_SD_CIN,2002,REG,1,2002-09-08,Sunday,13:00,...,Marty Schottenheimer,Dick LeBeau,Johnny Grier,CIN00,Paul Brown Stadium,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6416,2022_20_CIN_BUF,2022,DIV,20,2023-01-22,Sunday,15:00,...,Zac Taylor,Sean McDermott,Carl Cheffers,BUF00,New Era Field,0,0
6417,2022_20_DAL_SF,2022,DIV,20,2023-01-22,Sunday,18:30,...,Mike McCarthy,Kyle Shanahan,Bill Vinovich,SFO01,Levi's Stadium,1,2
6418,2022_21_SF_PHI,2022,CON,21,2023-01-29,Sunday,15:00,...,Kyle Shanahan,Nick Sirianni,John Hussey,PHI00,Lincoln Financial Field,0,1
6419,2022_21_CIN_KC,2022,CON,21,2023-01-29,Sunday,18:30,...,Zac Taylor,Andy Reid,Ron Torbert,KAN00,GEHA Field at Arrowhead Stadium,0,1


Ok lets reduce the columns to the ones we need and merge with our original df

In [None]:
merge_df = feature_df[['game_id', 'rest differential', 'windy', 'div_game', 'spread_line', 'total_line', 'game_type', 'location']]

In [None]:
final_df = pd.merge(model_df, merge_df, on="game_id", how='left')

# Only reg season games
final_df = final_df.loc[final_df.game_type == "REG"]

# Adding a home/away column
final_df['home_game'] = np.where(final_df.location == "Home", 1, 0)

In [None]:
# drop game_id & helper columns
final_df.drop(["game_id", 'game_type', 'location'], axis=1, inplace=True)

final_df.tail(50)

Unnamed: 0,season,week,home_team,away_team,home_score,score_diff,away_score,...,ewma_dynamic_window_passing_defense_away,rest differential,windy,div_game,spread_line,total_line,home_game
5581,2022,15,CAR,PIT,16,-8,24,...,0.05,0,0,0,2.5,36.5,1
5582,2022,15,SEA,SF,13,-8,21,...,-0.09,0,0,1,-3.0,43.0,1
5583,2022,15,LAC,TEN,17,3,14,...,0.09,0,0,0,3.0,46.5,1
5584,2022,16,BAL,ATL,17,8,9,...,0.19,1,0,0,6.5,35.0,1
5585,2022,16,CHI,BUF,13,-22,35,...,-0.01,-1,1,0,-8.0,40.0,1
5586,2022,16,NE,CIN,18,-4,22,...,0.01,0,0,0,-3.0,42.0,1
5587,2022,16,LA,DEN,51,37,14,...,-0.09,-1,0,0,-3.0,36.5,1
5588,2022,16,CAR,DET,37,14,23,...,0.09,0,0,0,-2.5,43.5,1
5589,2022,16,MIA,GB,20,-6,26,...,-0.02,2,0,0,3.5,49.0,1
5590,2022,16,TEN,HOU,14,-5,19,...,0.05,0,0,1,3.0,34.0,1


ok lets output back to csv so that we can bring it in the other files to model on without messing with the final result

In [None]:
final_df.to_csv("/content/drive/MyDrive/Colab Notebooks/NFL Model/Data/Model Data/final_data.csv")

Okay we should probably do some visualizations to see what this data actually is telling us