For the assignment we will produce the out-of-sample forecast by using NBA 2018/2019 data. To do that, we will split the data into two different strata so that we can use one of the subsets for generating the forecasting model (i.e., training dataset) and we will apply the forecasting model to generate the results for the remainder of the dataset. More specifically, we will use the games played in 2018 as a basis for forecasting and apply the regression model to generate the game results played in 2019. This exercise is slightly easier than the EPL case, since there are no ties, and two possible outcomes for each game, win or lose. This also means that we can use the logit model (allowing for only two possible outcomes introduced in week 1, rather the ordered logit model (which we only need when there are more than two possible outcomes).

In [None]:
# This allows us to show the full screen width

# from IPython.display import display, HTML

# display(HTML(data="""
# <style>
#     div#notebook-container    { width: 95%; }
#     div#menubar-container     { width: 65%; }
#     div#maintoolbar-container { width: 99%; }
# </style>
# """))

In [1]:
import pandas as pd
import numpy as np

## Step 1: Data preparation

In [None]:
# 1.	Load the data

In [2]:
NBAmod = pd.read_excel("Assignment Data/Week 4/NBA prediction model (Assignment).xlsx")
NBAmod

Unnamed: 0,home,visitors,day,month,year,hwinodds,hloseodds,homepts,vispts,overtime,Game(home-away),hwin,hteamsal,opposal
0,Boston Celtics,Atlanta Hawks,16,3,2019,1.16,5.59,129,120,0,Boston Celtics - Atlanta Hawks,1,126822990,83389484
1,Boston Celtics,Atlanta Hawks,15,12,2018,1.10,7.40,129,108,0,Boston Celtics - Atlanta Hawks,1,126822990,83389484
2,Brooklyn Nets,Atlanta Hawks,10,1,2019,1.25,4.13,116,100,0,Brooklyn Nets - Atlanta Hawks,1,123191458,83389484
3,Brooklyn Nets,Atlanta Hawks,16,12,2018,1.33,3.42,144,127,0,Brooklyn Nets - Atlanta Hawks,1,123191458,83389484
4,Charlotte Hornets,Atlanta Hawks,7,11,2018,1.12,6.67,113,102,0,Charlotte Hornets - Atlanta Hawks,1,115503070,83389484
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,Sacramento Kings,Washington Wizards,27,10,2018,2.74,1.47,116,112,0,Sacramento Kings - Washington Wizards,1,94318099,130318434
1226,San Antonio Spurs,Washington Wizards,28,1,2019,1.57,2.47,132,119,0,San Antonio Spurs - Washington Wizards,1,129760213,130318434
1227,Toronto Raptors,Washington Wizards,14,2,2019,1.20,4.76,129,120,0,Toronto Raptors - Washington Wizards,1,133823098,130318434
1228,Toronto Raptors,Washington Wizards,24,11,2018,1.17,5.32,125,107,0,Toronto Raptors - Washington Wizards,1,133823098,130318434


In [None]:
# 2.	Define variables for the probabilities of a home win and away win associated with bookmaker odds

In [3]:
# define the probabilities associated with bookmaker odds

NBAmod['bookieprobH'] = 1/NBAmod['hwinodds']/(1/NBAmod['hwinodds']+1/NBAmod['hloseodds'])
NBAmod['bookieprobA'] = 1/NBAmod['hloseodds']/(1/NBAmod['hwinodds']+1/NBAmod['hloseodds'])

In [None]:
# 3.	Define a dummy variable = 1 if the home team loses, and zero otherwise

In [4]:
# the variable 'win' gives us value of 1 when the home team wins (and zero otherwise). We now creat the variable 'lose',
# which has a value of 1 when the home team loses (and zero otherwise), which we will need later

NBAmod['hlose']= np.where(NBAmod['hwin']==0, 1, 0)

In [None]:
# 4.	Define a variable equal to H if the home team wins and A if the visiting team wins

In [None]:
# 5.	Define a variable equal to H if the home team win probability is greater than 0.5 according to the bookmaker odds and A otherwise

In [5]:
# define the predicted result based on the betting odds 

NBAmod['bookres']= np.where(NBAmod['hwinodds']<NBAmod['hloseodds'], "H", "A")
NBAmod['FTR']= np.where(NBAmod['hwin']==1, "H", "A")
NBAmod

Unnamed: 0,home,visitors,day,month,year,hwinodds,hloseodds,homepts,vispts,overtime,Game(home-away),hwin,hteamsal,opposal,bookieprobH,bookieprobA,hlose,bookres,FTR
0,Boston Celtics,Atlanta Hawks,16,3,2019,1.16,5.59,129,120,0,Boston Celtics - Atlanta Hawks,1,126822990,83389484,0.828148,0.171852,0,H,H
1,Boston Celtics,Atlanta Hawks,15,12,2018,1.10,7.40,129,108,0,Boston Celtics - Atlanta Hawks,1,126822990,83389484,0.870588,0.129412,0,H,H
2,Brooklyn Nets,Atlanta Hawks,10,1,2019,1.25,4.13,116,100,0,Brooklyn Nets - Atlanta Hawks,1,123191458,83389484,0.767658,0.232342,0,H,H
3,Brooklyn Nets,Atlanta Hawks,16,12,2018,1.33,3.42,144,127,0,Brooklyn Nets - Atlanta Hawks,1,123191458,83389484,0.720000,0.280000,0,H,H
4,Charlotte Hornets,Atlanta Hawks,7,11,2018,1.12,6.67,113,102,0,Charlotte Hornets - Atlanta Hawks,1,115503070,83389484,0.856226,0.143774,0,H,H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,Sacramento Kings,Washington Wizards,27,10,2018,2.74,1.47,116,112,0,Sacramento Kings - Washington Wizards,1,94318099,130318434,0.349169,0.650831,0,A,H
1226,San Antonio Spurs,Washington Wizards,28,1,2019,1.57,2.47,132,119,0,San Antonio Spurs - Washington Wizards,1,129760213,130318434,0.611386,0.388614,0,H,H
1227,Toronto Raptors,Washington Wizards,14,2,2019,1.20,4.76,129,120,0,Toronto Raptors - Washington Wizards,1,133823098,130318434,0.798658,0.201342,0,H,H
1228,Toronto Raptors,Washington Wizards,24,11,2018,1.17,5.32,125,107,0,Toronto Raptors - Washington Wizards,1,133823098,130318434,0.819723,0.180277,0,H,H


In [6]:
# optional crosstab showing how often the bookmaker odds correctly predict the result

pd.crosstab(NBAmod['FTR'], NBAmod['bookres'], dropna=True)

bookres,A,H
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1
A,262,239
H,163,566


In [None]:
# 6.	Define a variable equal to the logarithm of the ratio of the home team salaries to the visiting team salaries

In [7]:
# create the log of salary ratios 
NBAmod['lhsalratio'] = np.log(NBAmod['hteamsal']/NBAmod['opposal'])

## Step 2: Estimate a logit model of home time wins depending on the log salary ratio, using the data for calendar year 2018 as the “training data”

In [None]:
# 1.	Define a subset for the calendar year 2018 data

In [8]:
NBAtrain = NBAmod[NBAmod.year==2018]

In [None]:
# 2.	Import the logistic regression package (copy the code for this from Week 1)

In [9]:
from sklearn.linear_model import LogisticRegression

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# 3.	Run the logistic regression of hwin on the log salary ratio 
# (copy the code for this from Week 1 while changing the variable names to the ones required here)

In [10]:
model = smf.glm(formula = 'hwin ~ lhsalratio', data=NBAtrain, family=sm.families.Binomial())
result = model.fit()
print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                   hwin   No. Observations:                  542
Model:                            GLM   Df Residuals:                      540
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -359.61
Date:                Fri, 20 Dec 2024   Deviance:                       719.21
Time:                        11:05:50   Pearson chi2:                     543.
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4452      0.089      5.026      0.0

In [None]:
# Q1: How many games were played in calendar year 2018
# A1: 542

In [None]:
# Q2: From the logistic model, what is the coefficient of the salary ratio variable?
# A2: 1.1216 

In [None]:
# Q3: In the logistic model, what can we say about the statistical significance on the variables?
# A3: Both are statistically significant at the 5% level (p-value) 

In [None]:
# Q4: In the logistic regression model, what is the interpretation of the constant (intercept)
# A4: It reflects the value of home advantage

## Step 3: Define the predicted probabilities and the predicted results, using the entire data set

In [None]:
# 1.	The predicted probability of home win can be defined using the formula 1/(1+1/(exp(b0 + b1 (logsalaryratio)))) 
# where b0 is the constant (the intercept) in the logistic regression and b1 is the coefficient for logsalaryratio

In [11]:
# predicted probabilities

NBAmod['predprobH']= 1/(1+1/(np.exp(.4452+1.1216*NBAmod['lhsalratio'])))
NBAmod

Unnamed: 0,home,visitors,day,month,year,hwinodds,hloseodds,homepts,vispts,overtime,...,hwin,hteamsal,opposal,bookieprobH,bookieprobA,hlose,bookres,FTR,lhsalratio,predprobH
0,Boston Celtics,Atlanta Hawks,16,3,2019,1.16,5.59,129,120,0,...,1,126822990,83389484,0.828148,0.171852,0,H,H,0.419270,0.714115
1,Boston Celtics,Atlanta Hawks,15,12,2018,1.10,7.40,129,108,0,...,1,126822990,83389484,0.870588,0.129412,0,H,H,0.419270,0.714115
2,Brooklyn Nets,Atlanta Hawks,10,1,2019,1.25,4.13,116,100,0,...,1,123191458,83389484,0.767658,0.232342,0,H,H,0.390218,0.707416
3,Brooklyn Nets,Atlanta Hawks,16,12,2018,1.33,3.42,144,127,0,...,1,123191458,83389484,0.720000,0.280000,0,H,H,0.390218,0.707416
4,Charlotte Hornets,Atlanta Hawks,7,11,2018,1.12,6.67,113,102,0,...,1,115503070,83389484,0.856226,0.143774,0,H,H,0.325775,0.692235
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,Sacramento Kings,Washington Wizards,27,10,2018,2.74,1.47,116,112,0,...,1,94318099,130318434,0.349169,0.650831,0,A,H,-0.323308,0.520633
1226,San Antonio Spurs,Washington Wizards,28,1,2019,1.57,2.47,132,119,0,...,1,129760213,130318434,0.611386,0.388614,0,H,H,-0.004293,0.608351
1227,Toronto Raptors,Washington Wizards,14,2,2019,1.20,4.76,129,120,0,...,1,133823098,130318434,0.798658,0.201342,0,H,H,0.026538,0.616558
1228,Toronto Raptors,Washington Wizards,24,11,2018,1.17,5.32,125,107,0,...,1,133823098,130318434,0.819723,0.180277,0,H,H,0.026538,0.616558


In [None]:
# 2.	Based on the predicted probability, define the predicted result H as the outcome where 
# the predicted home win probability is greater than 0.5, and A otherwise

In [12]:
NBAmod['predres']= np.where(NBAmod['predprobH']>.5, "H", "A")
NBAmod

Unnamed: 0,home,visitors,day,month,year,hwinodds,hloseodds,homepts,vispts,overtime,...,hteamsal,opposal,bookieprobH,bookieprobA,hlose,bookres,FTR,lhsalratio,predprobH,predres
0,Boston Celtics,Atlanta Hawks,16,3,2019,1.16,5.59,129,120,0,...,126822990,83389484,0.828148,0.171852,0,H,H,0.419270,0.714115,H
1,Boston Celtics,Atlanta Hawks,15,12,2018,1.10,7.40,129,108,0,...,126822990,83389484,0.870588,0.129412,0,H,H,0.419270,0.714115,H
2,Brooklyn Nets,Atlanta Hawks,10,1,2019,1.25,4.13,116,100,0,...,123191458,83389484,0.767658,0.232342,0,H,H,0.390218,0.707416,H
3,Brooklyn Nets,Atlanta Hawks,16,12,2018,1.33,3.42,144,127,0,...,123191458,83389484,0.720000,0.280000,0,H,H,0.390218,0.707416,H
4,Charlotte Hornets,Atlanta Hawks,7,11,2018,1.12,6.67,113,102,0,...,115503070,83389484,0.856226,0.143774,0,H,H,0.325775,0.692235,H
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,Sacramento Kings,Washington Wizards,27,10,2018,2.74,1.47,116,112,0,...,94318099,130318434,0.349169,0.650831,0,A,H,-0.323308,0.520633,H
1226,San Antonio Spurs,Washington Wizards,28,1,2019,1.57,2.47,132,119,0,...,129760213,130318434,0.611386,0.388614,0,H,H,-0.004293,0.608351,H
1227,Toronto Raptors,Washington Wizards,14,2,2019,1.20,4.76,129,120,0,...,133823098,130318434,0.798658,0.201342,0,H,H,0.026538,0.616558,H
1228,Toronto Raptors,Washington Wizards,24,11,2018,1.17,5.32,125,107,0,...,133823098,130318434,0.819723,0.180277,0,H,H,0.026538,0.616558,H


## Step 4: For games played in 2019, compare the bookmaker probabilities and model probabilities in terms of the mean number of successfully predicted outcomes and the Brier scores

In [None]:
# 1.	Define the subset of games played in calendar year 2019

In [13]:
NBAfore = NBAmod[NBAmod.year==2019]

In [None]:
# 2.	Define a dummy variable equal to 1 when the bookmaker result prediction is correct, and zero otherwise. 
# Define the equivalent variable for the logit model prediction

In [14]:
NBAfore['booktrue']= np.where(NBAfore['FTR']==NBAfore['bookres'], 1, 0)
NBAfore['modeltrue']= np.where(NBAfore['FTR']==NBAfore['predres'], 1, 0)

In [None]:
# 3.	Calculate the means for each of these variables

In [15]:
NBAfore['booktrue'].mean()

0.6918604651162791

In [None]:
# Q5: Based on the bookmaker odds, what fraction of results were correctly predicted
# A5: 69% 

In [16]:
NBAfore['modeltrue'].mean()

0.5872093023255814

In [None]:
# Q6: Based on the logistic model, what fraction of results were correctly predicted
# A6: 59%

In [None]:
# 4.	Define the Brier score for the bookmaker probabilities and the Brier score for the logit model probabilities

In [17]:
NBAfore['Brierbookie']= (NBAfore['hwin']-NBAfore['bookieprobH'])**2 +(NBAfore['hlose']-NBAfore['bookieprobA'])**2
NBAfore['Briermodel']= (NBAfore['hwin']-NBAfore['predprobH'])**2 +(NBAfore['hlose']-(1-NBAfore['predprobH']))**2

In [None]:
# 5.	Calculate the mean of each Brier score

In [19]:
NBAfore['Brierbookie'].mean()

0.3939058021238844

In [None]:
# Q7: What was the Brier score derived from the bookmaker odds?
# A7: 0.394 

In [20]:
NBAfore['Briermodel'].mean()

0.4768713411427079

In [None]:
# Q8: What was the Brier score derived from the logistic model?
# A8: 0.477 

In [None]:
# Q9: A lower Brier score implies
# A9: The probabilities were closer to the actual outcomes

In [None]:
# Q10: Suppose that the logistic model were updated after every game in the season, which of the following is most likely to be true:
# A10: The logistic model would produce more reliable forecasts