# How Accurate are Bookies at Setting Over/Under Lines?
## Grant Cloud

imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

  data_klasses = (pandas.Series, pandas.DataFrame, pandas.Panel)


loading the data

In [2]:
df = pd.read_csv('data/nfl_over_unders.csv')
df.head(5)

Unnamed: 0,schedule_season,score_home,score_away,over_under_line,total_score,cover
0,2000,36.0,28.0,46.5,64.0,over
1,2000,16.0,13.0,40.0,29.0,under
2,2000,7.0,27.0,38.5,34.0,under
3,2000,14.0,41.0,39.5,55.0,over
4,2000,16.0,20.0,44.0,36.0,under


<p>viewing the how many times each possible outcome has occurred</p>

In [3]:
covSer = df.cover.value_counts()
covSer

under    2519
over     2448
push       90
Name: cover, dtype: int64

<p>calculating probabilities of each outcome occuring</p>

In [4]:
u = covSer[0] / (covSer[0] + covSer[1])
o = covSer[1] / (covSer[0] + covSer[1])
push = covSer[2] / (covSer[0] + covSer[1] + covSer[2])
pd.Series([u,o,push], index=df.cover.value_counts().index)

under    0.507147
over     0.492853
push     0.017797
dtype: float64

<p>we see that the under hits more often than the over hits. presumably, this occurs because betters have over bias and vegas is increasing their margins by slightly shifting lines higher than they're actually expecting</p>

we can also fit a simple linear regression model to better understand how well vegas is predicting the total score of NFL games

In [5]:
X = df.loc[:,'over_under_line']
y = df.loc[:,'total_score']

In [6]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            total_score   R-squared:                       0.914
Model:                            OLS   Adj. R-squared:                  0.914
Method:                 Least Squares   F-statistic:                 5.368e+04
Date:                Sat, 14 Dec 2019   Prob (F-statistic):               0.00
Time:                        18:56:19   Log-Likelihood:                -20339.
No. Observations:                5057   AIC:                         4.068e+04
Df Residuals:                    5056   BIC:                         4.069e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
over_under_line     1.0129      0.004    2

<p>from the R<sup>2</sup> value we see that Vegas' predicted over/under accounts for 91.4% of the total variation in the total score of NFL games. another interesting side note: the over_under_line coefficient is > 1, implying that the Vegas line is slightly lower than the expected total score. this contradicts with the actual over under hit rates we saw above, and may be caused by unusual observations (i.e. games with low expected scores that turn into shootouts).</p>