## Part 1 - Import and Explore Data 

## Step (1) 
Import the “nfl_game.csv” data file and name the dataframe as “NFL_Game” in Jupyter Notebook.

Descriptions of selected variables:
   - Unit of measurement of weather variables: 
       - temperature (degF)
       - relative humidity (%rh)
       - wind (mph)
   - The variable “score” is the score earned by the team specified in the “team” variable. The variable “opponent_score” is the score earned by the team specified in the “opponent” variable.
   - The variable “score_diff” is defined as “score – opponent_score” for the given team.
   - The variable “stadium_age” is defined as the difference between the year of the season and the year the stadium opened.
   - The variable “stadium_neutral” indicates if the game was played in a third stadium, which is neither the home team’s nor the away team’s own stadium.

In [1]:
#Import Libraries
import pandas as pd
import numpy as np
import datetime
import statsmodels.formula.api as sm

In [11]:
#Import NFL_Game dataset
NFL_Game=pd.read_csv("Assignment Data/nfl_game.csv")
display(NFL_Game)

Unnamed: 0,game_id,stadium,date,score,weather_temperature,weather_wind_mph,weather_humidity,score_diff,home,win,...,stadium_close,stadium_type,stadium_age,conference,division,season,team_division,team_division_pre2002,team,stadium_neutral
0,1,Orange Bowl,9/2/1966,14,83,6,71,-9,1,0,...,2008.0,outdoor,29,NFC,NFC West,1966,AFC East,AFC East,Miami Dolphins,0
1,1,Orange Bowl,9/2/1966,23,83,6,71,9,0,1,...,2008.0,outdoor,29,AFC,AFC Central,1966,AFC West,AFC West,Oakland Raiders,0
2,2,Rice Stadium,9/3/1966,45,81,7,70,38,1,1,...,,outdoor,16,NFC,NFC West,1966,,AFC Central,Houston Oilers,0
3,2,Rice Stadium,9/3/1966,7,81,7,70,-38,0,0,...,,outdoor,16,NFC,NFC West,1966,AFC West,AFC West,Denver Broncos,0
4,3,Balboa Stadium,9/4/1966,27,70,7,82,20,1,1,...,,outdoor,52,NFC,NFC West,1966,AFC West,AFC West,San Diego Chargers,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24309,12155,MetLife Stadium,12/29/2019,17,42,5,69,-17,1,0,...,,outdoor,9,NFC,NFC East,2019,NFC East,NFC East,New York Giants,0
24310,12156,CenturyLink Field,12/29/2019,26,51,3,70,5,0,1,...,,outdoor,17,AFC,AFC West,2019,NFC West,NFC West,San Francisco 49ers,0
24311,12156,CenturyLink Field,12/29/2019,21,51,3,70,-5,1,0,...,,outdoor,17,AFC,AFC West,2019,NFC West,AFC West,Seattle Seahawks,0
24312,12157,Raymond James Stadium,12/29/2019,28,78,12,77,6,0,1,...,2016.0,outdoor,21,NFC,AFC East,2019,NFC South,NFC West,Atlanta Falcons,0


In [3]:
NFL_Game.columns

Index(['game_id', 'stadium', 'date', 'score', 'weather_temperature',
       'weather_wind_mph', 'weather_humidity', 'score_diff', 'home', 'win',
       'opponent_score', 'opponent', 'team_id', 'team_conference',
       'team_conference_pre2002', 'stadium_location', 'stadium_capacity',
       'stadium_open', 'stadium_close', 'stadium_type', 'stadium_age',
       'conference', 'division', 'season', 'team_division',
       'team_division_pre2002', 'team', 'stadium_neutral'],
      dtype='object')

## Step (2)
Use the “describe” function to calculate summary statistics for the “date” variable. Use the “describe” function to calculate summary statistics for the “score” variable based on whether it is a home or an away game for the team.

In [13]:
# Convert to datetime
NFL_Game['date'] = pd.to_datetime(NFL_Game['date'])

In [14]:
NFL_Game['date'].describe()

  NFL_Game['date'].describe()


count                   24314
unique                   2065
top       2019-12-29 00:00:00
freq                       32
first     1966-09-02 00:00:00
last      2019-12-29 00:00:00
Name: date, dtype: object

In [21]:
# Home
print('Home:\n',NFL_Game[NFL_Game['home']==1]['score'].describe())

# Away
print('\nAway:\n',NFL_Game[NFL_Game['home']==0]['score'].describe())

Home:
 count    12157.000000
mean        22.254997
std         10.533005
min          0.000000
25%         14.000000
50%         21.000000
75%         29.000000
max         72.000000
Name: score, dtype: float64

Away:
 count    12157.000000
mean        19.643004
std         10.166614
min          0.000000
25%         13.000000
50%         20.000000
75%         27.000000
max         62.000000
Name: score, dtype: float64


## Step (3)
Find the correlation coefficients between the following pairs of variables:
   - 'win' and 'home'       
   - 'score_diff' and 'home'
   - 'score' and 'weather_temperature'
   - 'score' and 'weather_humidity'    
   - 'score' and 'weather_wind_mph'

In [23]:
print('win and home', NFL_Game['win'].corr(NFL_Game['home']))
print('score_diff and home', NFL_Game['score_diff'].corr(NFL_Game['home']))
print('score and weather_temperature', NFL_Game['score'].corr(NFL_Game['weather_temperature']))
print('score and weather_humidity', NFL_Game['score'].corr(NFL_Game['weather_humidity']))
print('score and weather_wind_mph', NFL_Game['score'].corr(NFL_Game['weather_wind_mph']))

win and home 0.14790211753177582
score_diff and home 0.1725059685987271
score and weather_temperature 0.03361690080893201
score and weather_humidity -0.03278832087607436
score and weather_wind_mph -0.07895602955806053


## Part 2 - Regression Analysis 1 - Test of Home Game Advantage

In [25]:
reg1_1 = sm.ols(formula='score_diff~home', data=NFL_Game).fit()
reg1_1.summary()

0,1,2,3
Dep. Variable:,score_diff,R-squared:,0.03
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,745.7
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,9.64e-162
Time:,20:39:24,Log-Likelihood:,-100200.0
No. Observations:,24314,AIC:,200400.0
Df Residuals:,24312,BIC:,200400.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.6120,0.135,-19.309,0.000,-2.877,-2.347
home,5.2240,0.191,27.307,0.000,4.849,5.599

0,1,2,3
Omnibus:,30.538,Durbin-Watson:,3.021
Prob(Omnibus):,0.0,Jarque-Bera (JB):,34.843
Skew:,0.0,Prob(JB):,2.72e-08
Kurtosis:,3.185,Cond. No.,2.62


In [26]:
reg1_2 = sm.ols(formula='score_diff~home+stadium_capacity+stadium_neutral+home*stadium_neutral', 
                data=NFL_Game).fit()
reg1_2.summary()

0,1,2,3
Dep. Variable:,score_diff,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,191.7
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,3.99e-162
Time:,20:42:45,Log-Likelihood:,-100190.0
No. Observations:,24314,AIC:,200400.0
Df Residuals:,24309,BIC:,200400.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.6386,0.695,-3.796,0.000,-4.001,-1.276
home,5.2765,0.192,27.542,0.000,4.901,5.652
stadium_capacity,5.195e-09,9.91e-06,0.001,1.000,-1.94e-05,1.94e-05
stadium_neutral,7.2518,2.253,3.218,0.001,2.835,11.669
home:stadium_neutral,-14.5038,3.185,-4.554,0.000,-20.746,-8.262

0,1,2,3
Omnibus:,30.413,Durbin-Watson:,3.021
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.975
Skew:,0.0,Prob(JB):,1.54e-08
Kurtosis:,3.188,Cond. No.,2650000.0


In [27]:
reg1_3 = sm.ols(formula='score_diff~home+stadium_capacity+stadium_neutral+home*stadium_neutral+team*opponent', 
                data=NFL_Game).fit()
reg1_3.summary()

0,1,2,3
Dep. Variable:,score_diff,R-squared:,0.128
Model:,OLS,Adj. R-squared:,0.073
Method:,Least Squares,F-statistic:,2.321
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,1.3399999999999999e-137
Time:,20:46:15,Log-Likelihood:,-98906.0
No. Observations:,24314,AIC:,200700.0
Df Residuals:,22866,BIC:,212400.0
Df Model:,1447,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.7118,1.598,-1.697,0.090,-5.844,0.420
team[T.Atlanta Falcons],8.9218,3.910,2.282,0.023,1.257,16.586
team[T.Baltimore Colts],6.5544,1.521,4.310,0.000,3.574,9.535
team[T.Baltimore Ravens],1.7530,5.524,0.317,0.751,-9.074,12.580
team[T.Boston Patriots],-8.5357,3.080,-2.771,0.006,-14.573,-2.499
team[T.Buffalo Bills],6.1404,6.475,0.948,0.343,-6.552,18.833
team[T.Carolina Panthers],4.9977,4.032,1.240,0.215,-2.904,12.900
team[T.Chicago Bears],-0.2563,4.914,-0.052,0.958,-9.887,9.375
team[T.Cincinnati Bengals],-2.8055,5.188,-0.541,0.589,-12.975,7.364

0,1,2,3
Omnibus:,36.545,Durbin-Watson:,3.026
Prob(Omnibus):,0.0,Jarque-Bera (JB):,44.034
Skew:,0.0,Prob(JB):,2.74e-10
Kurtosis:,3.208,Cond. No.,9.2e+21


## Part 3 - Regression Analysis 2 -- Impact of Outside Factors on Scores

In [33]:
# Answer: The estimated coef on the home variable is 2.61. This means that in each game, home team is 
# estimated to score 2.61 more than the away team
reg2_1 = sm.ols(formula='score~season+home',
                data=NFL_Game).fit()
reg2_1.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.022
Method:,Least Squares,F-statistic:,273.3
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,4.37e-118
Time:,21:08:11,Log-Likelihood:,-91246.0
No. Observations:,24314,AIC:,182500.0
Df Residuals:,24311,BIC:,182500.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-88.2969,8.615,-10.249,0.000,-105.183,-71.411
season,0.0541,0.004,12.530,0.000,0.046,0.063
home,2.6120,0.132,19.736,0.000,2.353,2.871

0,1,2,3
Omnibus:,480.134,Durbin-Watson:,2.056
Prob(Omnibus):,0.0,Jarque-Bera (JB):,508.905
Skew:,0.354,Prob(JB):,3.11e-111
Kurtosis:,2.992,Cond. No.,260000.0


In [34]:
# Answer: Temperature
reg2_2 = sm.ols(formula='score~season+home+weather_temperature+weather_wind_mph+weather_humidity',
                data=NFL_Game).fit()
reg2_2.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.026
Model:,OLS,Adj. R-squared:,0.026
Method:,Least Squares,F-statistic:,129.9
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,2.6e-136
Time:,21:12:18,Log-Likelihood:,-91195.0
No. Observations:,24314,AIC:,182400.0
Df Residuals:,24308,BIC:,182500.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-58.1972,9.142,-6.366,0.000,-76.117,-40.277
season,0.0400,0.005,8.797,0.000,0.031,0.049
home,2.6120,0.132,19.776,0.000,2.353,2.871
weather_temperature,-0.0006,0.005,-0.127,0.899,-0.010,0.009
weather_wind_mph,-0.1188,0.013,-8.937,0.000,-0.145,-0.093
weather_humidity,-0.0150,0.004,-3.506,0.000,-0.023,-0.007

0,1,2,3
Omnibus:,482.73,Durbin-Watson:,2.061
Prob(Omnibus):,0.0,Jarque-Bera (JB):,511.818
Skew:,0.355,Prob(JB):,7.25e-112
Kurtosis:,2.992,Cond. No.,276000.0


In [31]:
# Answer: The estimated coef of stad capacity is...Since the size of the estimate is small, it is not signif
reg2_3 = sm.ols(formula='score~season+home+weather_temperature+weather_wind_mph+weather_humidity+'+
                        'stadium_capacity+stadium_age+stadium_type+stadium_neutral+home*stadium_neutral',
                data=NFL_Game).fit()
reg2_3.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.027
Model:,OLS,Adj. R-squared:,0.026
Method:,Least Squares,F-statistic:,60.8
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,2.0899999999999998e-134
Time:,21:04:52,Log-Likelihood:,-91186.0
No. Observations:,24314,AIC:,182400.0
Df Residuals:,24302,BIC:,182500.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-56.2290,9.412,-5.974,0.000,-74.677,-37.781
stadium_type[T.outdoor],0.3262,0.246,1.329,0.184,-0.155,0.808
stadium_type[T.retractable],0.5971,0.400,1.493,0.135,-0.187,1.381
season,0.0393,0.005,8.295,0.000,0.030,0.049
home,2.6382,0.132,19.944,0.000,2.379,2.898
weather_temperature,0.0009,0.005,0.183,0.855,-0.008,0.010
weather_wind_mph,-0.1329,0.017,-7.998,0.000,-0.166,-0.100
weather_humidity,-0.0144,0.004,-3.313,0.001,-0.023,-0.006
stadium_capacity,-1.237e-05,6.95e-06,-1.779,0.075,-2.6e-05,1.26e-06

0,1,2,3
Omnibus:,481.968,Durbin-Watson:,2.061
Prob(Omnibus):,0.0,Jarque-Bera (JB):,510.964
Skew:,0.355,Prob(JB):,1.11e-111
Kurtosis:,2.993,Cond. No.,9920000.0


In [35]:
NFL_Game['stadium_type'].unique()

array(['outdoor', 'indoor', 'retractable'], dtype=object)

In [32]:
# Answer1: The estimated effect of retractable stadium is 0.92, which suggests that 
# teams score 0.92 more at retractable than other stadiums WRONG
# Answer2: The estimated effect of retractable stadium is 0.92 which suggests that teams score 0.92 more
# points at retractable stadiums than at indoor stadiums
reg2_4 = sm.ols(formula='score~season+home+weather_temperature+weather_wind_mph+weather_humidity+'+
                        'stadium_capacity+stadium_age+stadium_type+stadium_neutral+home*stadium_neutral+'+
                        'team+opponent',
                data=NFL_Game).fit()
reg2_4.summary()

0,1,2,3
Dep. Variable:,score,R-squared:,0.052
Model:,OLS,Adj. R-squared:,0.048
Method:,Least Squares,F-statistic:,14.52
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,9.04e-211
Time:,21:07:25,Log-Likelihood:,-90870.0
No. Observations:,24314,AIC:,181900.0
Df Residuals:,24222,BIC:,182700.0
Df Model:,91,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-76.4276,10.760,-7.103,0.000,-97.518,-55.337
stadium_type[T.outdoor],0.5037,0.282,1.788,0.074,-0.048,1.056
stadium_type[T.retractable],0.9230,0.456,2.024,0.043,0.029,1.817
team[T.Atlanta Falcons],1.8932,0.632,2.996,0.003,0.655,3.132
team[T.Baltimore Colts],3.5594,0.834,4.267,0.000,1.924,5.194
team[T.Baltimore Ravens],3.1741,0.742,4.280,0.000,1.721,4.628
team[T.Boston Patriots],-0.8464,1.695,-0.499,0.618,-4.168,2.476
team[T.Buffalo Bills],1.9598,0.639,3.066,0.002,0.707,3.213
team[T.Carolina Panthers],1.7296,0.729,2.374,0.018,0.301,3.158

0,1,2,3
Omnibus:,466.658,Durbin-Watson:,2.05
Prob(Omnibus):,0.0,Jarque-Bera (JB):,493.763
Skew:,0.349,Prob(JB):,6.03e-108
Kurtosis:,3.014,Cond. No.,11500000.0
