# Field Goals

In 2013, in the paper Going For Three, the authors undertook predictive modelling of field goal success in the NFL from 2000 to 2011. The study found that several psychological factors had little to no effect on the likelihood of field goal conversion. In 2018, in the paper Choking Under the Pressure, the authors extended the aforementioned study to include data post-2011, up to and including 2017. They found that some of the psychological factors were in fact significant at the 0.1 level.

Neither study looked into the possibility of interactions between the supposedly (independent) variables, which may dilute the observed main effects. For example, one may reasonably expect that the effect of wind on a kick would increase with distance.

Both studies took a frequentist statistical approach, which although provide frequency guarantees with regards to confidence intervals etc, accept and reject significance of variables at arbitrarly decided values of 0.05 and 0.1. Within Bayesian statistics, by framing parameters as random variables, we are able to give likelihoods of each variable being within some range (credible interval). Arguably more intuitive than the frequentist confidence interval. For small sample sizes, we do forgo some of the frequentist guarantees of confidence intervals, but these concerns disappear for larger samples as the two methods converge asymptotically.

This notebook will include exploring the data, validating the results of the earlier papers, and exploring the possibility of interactions. With the most influential interactions, we will use a frequentist approach to the pre-2011 data, and see how our results differ from Going For Three. In the next notebook, we will incorporate these results into the prior distributions for the independent variables and train a model on post-2011 data.

## Imports

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import glm as glm_sm
import numpy as np
import mysql.connector
import itertools
from scipy.stats import chi2

## Load Data

Set up a connection to the mysql database served locally.

In [11]:
cnx = mysql.connector.connect(user='root', password='mOntie20!mysql', host='127.0.0.1', database='nfl')

Our base query doesn't change between models so we'll keep it here.

In [12]:
base_query = '''select
p.pid,fg.good,fg.dist, 
g.seas as year, k.seas as seasons,
case when g.temp<50 then 1 else 0 end as cold,
case when g.stad like "%Mile High%" then 1 else 0 end as altitude,
case when g.humd>=60 then 1 else 0 end as humid,
case when g.wspd>=10 then 1 else 0 end as windy,
case when g.v=p.off then 1 else 0 end as away_game,
case when g.wk>=10 then 1 else 0 end as postseason,
case when (pp.qtr=p.qtr) and ((pp.timd-p.timd)>0 or (pp.timo-p.timo)>0) then 1 else 0 end as iced,
case g.surf when 'Grass' then 0 else 1 end as turf,
case when g.cond like "%Snow%" then 1 when g.cond like "%Rain%" and not "Chance Rain" then 1 else 0 end as precipitation,
case when p.qtr=4 and ABS(p.ptso - p.ptsd)>21 then 0
when p.qtr=4 and p.min<2 and ABS(p.ptso - p.ptsd)>8 then 0
when p.qtr=4 and p.min<2 and p.ptso-p.ptsd < -7 then 0
when p.qtr<=3 then 0
when p.qtr=4 and p.min>=2 and ABS(p.ptso - p.ptsd)<21 then 0
when p.qtr=4 and p.min<2 and p.ptso-p.ptsd >=5 and p.ptso-p.ptsd <=8 then 0
when p.qtr=4 and p.min<2 and p.ptso-p.ptsd >=-4 and p.ptso-p.ptsd <=-6 then 0
else 1 end as pressure'''

## Going For Three

Query the database for the data used by Going For Three. i.e. pre-2011

In [13]:
query = base_query+'''
from FGXP fg
left join PLAY p on fg.pid=p.pid
left join game g on p.gid=g.gid
join kicker k on k.player = fg.fkicker and g.gid=k.gid
join PLAY pp on pp.pid=p.pid-1 and pp.gid=p.gid
where fg.fgxp='FG' -- not an xp
and g.seas <= 2011
order by p.pid
'''

df = pd.read_sql(query, cnx, index_col = 'pid')
df.head(10)

Unnamed: 0_level_0,good,dist,year,seasons,cold,altitude,humid,windy,away_game,postseason,iced,turf,precipitation,pressure
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
17,1,43,2000,19,0,0,0,0,0,0,0,1,0,0
34,1,44,2000,19,0,0,0,0,0,0,0,1,0,0
52,1,24,2000,19,0,0,0,0,0,0,0,1,0,0
64,1,44,2000,19,0,0,0,0,0,0,0,1,0,0
95,1,48,2000,19,0,0,0,0,0,0,0,1,0,0
241,1,50,2000,6,0,0,1,0,1,0,1,0,0,0
277,1,25,2000,6,0,0,1,0,1,0,0,0,0,0
375,1,33,2000,3,0,0,0,0,1,0,0,1,0,0
387,1,34,2000,1,0,0,0,0,0,0,0,1,0,0
401,1,38,2000,1,0,0,0,0,0,0,0,1,0,0


Going For Three didn't use the year and seasons variables in their tabulated results, so our first model won't either.

In [14]:
model = glm_sm('good ~ ' + '+'.join(df.drop(['year','seasons','good'], axis=1).columns.values), df, family=sm.families.Binomial())
result = model.fit(method='newton')
print(result.summary())
base_ll = pd.read_html(result.summary().tables[0].as_html())[0].iloc[4,3]

Generalized Linear Model Regression Results                  
Dep. Variable:                   good   No. Observations:                11901
Model:                            GLM   Df Residuals:                    11889
Model Family:                Binomial   Df Model:                           11
Link Function:                  logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:                -5001.9
Date:                Mon, 17 Feb 2020   Deviance:                       10004.
Time:                        09:19:58   Pearson chi2:                 1.14e+04
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         5.6933      0.139     40.916      0.000      

We now add back in the year and seasons of experience and control for kickers that dont make it in the NFL (so >50 kicks).

In [15]:
query = base_query+'''
from FGXP fg
left join PLAY p on fg.pid=p.pid
left join game g on p.gid=g.gid
join kicker k on k.player = fg.fkicker and g.gid=k.gid
join PLAY pp on pp.pid=p.pid-1 and pp.gid=p.gid
where fg.fgxp='FG' -- not an xp
and fg.fkicker in (
select fkicker
from fifty) -- has had at least 50 attempts overall
and fg.pid > (
select pid
from fifty
where fg.fkicker = fkicker) -- this kick came after the 50th attempt
and g.seas <= 2011
order by p.pid
'''

df = pd.read_sql(query, cnx, index_col = 'pid')
df.head(5)

Unnamed: 0_level_0,good,dist,year,seasons,cold,altitude,humid,windy,away_game,postseason,iced,turf,precipitation,pressure
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
47329,1,29,2001,12,0,1,0,0,1,0,0,0,0,0
47352,1,26,2001,12,0,1,0,0,1,0,0,0,0,0
49295,1,26,2001,12,0,0,0,1,0,0,0,0,0,0
49344,1,25,2001,12,0,0,0,1,0,0,0,0,0,0
52241,1,28,2001,12,0,0,1,1,1,0,0,0,1,0


In [16]:
model = glm_sm('good ~ '+'+'.join(df.drop(['good'], axis=1).columns.values), df, family=sm.families.Binomial())
result = model.fit(method='newton')
print(result.summary())

Generalized Linear Model Regression Results                  
Dep. Variable:                   good   No. Observations:                 8572
Model:                            GLM   Df Residuals:                     8558
Model Family:                Binomial   Df Model:                           13
Link Function:                  logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:                -3425.7
Date:                Mon, 17 Feb 2020   Deviance:                       6851.4
Time:                        09:20:01   Pearson chi2:                 8.25e+03
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       -73.0399     20.541     -3.556      0.000    -1

## Choking Under the Pressure

In Choking Under the Pressure, they used similar data now from 2000-2017.
Lets repeat the modelling with this data, again leaving out the seasons and year covariates and not controlling for >50 kicks

In [17]:
query = base_query+'''
from FGXP fg
left join PLAY p on fg.pid=p.pid
left join game g on p.gid=g.gid
join kicker k on k.player = fg.fkicker and g.gid=k.gid
join PLAY pp on pp.pid=p.pid-1 and pp.gid=p.gid
where fg.fgxp='FG' -- not an xp
and g.seas <= 2017
order by p.pid
'''

df = pd.read_sql(query, cnx, index_col = 'pid')
df.head(5)

Unnamed: 0_level_0,good,dist,year,seasons,cold,altitude,humid,windy,away_game,postseason,iced,turf,precipitation,pressure
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
17,1,43,2000,19,0,0,0,0,0,0,0,1,0,0
34,1,44,2000,19,0,0,0,0,0,0,0,1,0,0
52,1,24,2000,19,0,0,0,0,0,0,0,1,0,0
64,1,44,2000,19,0,0,0,0,0,0,0,1,0,0
95,1,48,2000,19,0,0,0,0,0,0,0,1,0,0


In [18]:
model = glm_sm('good ~ '+'+'.join(df.drop(['good','seasons','year'], axis=1).columns.values), df, family=sm.families.Binomial())
result = model.fit(method='newton')
print(result.summary())

Generalized Linear Model Regression Results                  
Dep. Variable:                   good   No. Observations:                18166
Model:                            GLM   Df Residuals:                    18154
Model Family:                Binomial   Df Model:                           11
Link Function:                  logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:                -7370.5
Date:                Mon, 17 Feb 2020   Deviance:                       14741.
Time:                        09:20:04   Pearson chi2:                 1.74e+04
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         5.7011      0.115     49.681      0.000      