Currently tests two baseline models for the 2017 season.  Will update with more data as it is acquired.

First baseline is an average of the pointspread predictions from various casinos and betting outlets. This does not require training, though we could train a linear regression model w/ regularization and achieve a similar result (as we'd expect the betting outlets to be relatively equally predictive on average, so their weights should be highly similar under L2-regularized linear regression).

In [1]:
import numpy as np
import pandas as pd

In [32]:
#Import data
odds = pd.read_csv("base_data/nfl_odds_2017.csv")
game_stats = pd.read_csv("base_data/nflstats2017.csv")
game_stats.rename(index = str, columns = {'HPS': 'APS', 'HPSY': 'APSY', 'HPS.1':'HPS', 'HPSY.1':'HPSY'}, inplace = True)

In [3]:
#Formatting to create key and filter
odds["datestring"] = pd.to_datetime(odds['date']).dt.strftime('%Y%m%d')
odds["Spread_key"] = odds['datestring'] + odds['Home']
odds['Spread_val'] = odds['HomeScore'] - odds['AwayScore']
spreads = odds.filter(regex = "Spread_")

#Compute square loss over 2017 season
X = spreads.values[:,:-2].astype('float32')
y = spreads['Spread_val'].values.astype('float32')
scores = np.nanmean(X, axis = 1)
residual = (scores - y)**2
loss = np.nanmean(residual)
print(loss)

347.285


  # Remove the CWD from sys.path while we load stuff.


Second baseline is a regression tree model that will use basic categorical features (Home/Away team, start time, home/away team stats) to predict the outcome.  For current purposes, will train on first 8 weeks of the season and test on the rest (this will change with more data).  This requires a lot more preprocessing, as we need to aggregate average statistics at that point in the season for each team.

In [27]:
def process_team_stats(stats, team):
    t_stats = stats.query('Home == @team | Away == @team')
    agg_stats = pd.DataFrame()
    count = 0
    cols = ['FD','Fum','FumL','PY','PA','PI','PS', 'PSY', 'RA', 'RY']
    
    for index,week in t_stats.iterrows():
        if team == week['Home']:
            prefix = 'H'
        else:
            prefix = 'A'
            
 
    
    return agg_stats

In [28]:
game_stats['Datetime'] = pd.to_datetime(game_stats['Start'])
game_stats['Key'] = game_stats['Datetime'].dt.strftime('%Y%m%d') + game_stats['Home']
game_stats['Spread'] = game_stats['HPts'] - game_stats['APts']

teams = game_stats['Home'].unique()

for t in teams:
    process_team_stats(game_stats, t)


In [33]:
print(game_stats.columns)

Index(['Season', 'Start', 'Week', 'Away', 'Home', 'APts', 'HPts', 'OverUnder',
       'VegasLine', 'AFD', 'AFum', 'AFumL', 'APY', 'APA', 'API', 'APS', 'APSY',
       'ARA', 'ARY', 'HFD', 'HFum', 'HFumL', 'HPY', 'HPA', 'HPI', 'HPS',
       'HPSY', 'HRA', 'HRY'],
      dtype='object')
