# Week 4 Assignment
In this week, we extended the notion of forecasting model (introduced in Week 3) in the context of North American Team Sports Leagues (i.e., NHL, NBA, MLB). We also compared its performance to that of the bookmakers’ odds and observed the similar results as with EPL case in that bookmakers’ predictions were slightly better, the difference was not great, whether measured by number of correct predictions of the result, or Brier score which measures the overall accuracy of the probabilities. It should be noted that we adopted the “within forecast” approach to check the validity of the forecast model to be used for “out-of-sample forecast”. 


For the assignment we will produce the out-of-sample forecast by using NBA 2018/2019 data. To do that, we will split the data into two different strata so that we can use one of the subsets for generating the forecasting model (i.e., training dataset) and we will apply the forecasting model to generate the results for the remainder of the dataset. More specifically, we will use the games played in 2018 as a basis for forecasting and apply the regression model to generate the game results played in 2019. 

This exercise is slightly easier than the EPL case, since there are no ties, and two possible outcomes for each game, win or lose. This also means that we can use the logit model (allowing for only two possible outcomes introduced in week 1, rather the ordered logit model (which we only need when there are more than two possible outcomes).

**NOTE**: We don’t use the NBA dataset covered in the instructional video. The data set for the assignment compiled all the necessary variables including 1) book maker’s odds, salary for two teams, and the game results. So, you don’t need three different datasets as we did in the session.

In [15]:
import pandas as pd
import numpy as np
NBAmod = pd.read_excel("Assignment Data/NBA prediction model (Assignment).xlsx")

#### Step 1: Data preparation
1. Load the data 
2. Define variables for the probabilities of a home win and away win associated with bookmaker odds
3. Define a dummy variable = 1 if the home team loses, and zero otherwise
4. Define a variable equal to H if the home team wins and A if the visiting team wins
5. Define a variable equal to H if the home team win probability is greater than 0.5 according to the bookmaker odds and A otherwise
6. Define a variable equal to the logarithm of the ratio of the home team salaries to the visiting team salaries.

In [16]:
# Step 1
nba = NBAmod.copy()

# Step 2
nba['HwinBmPr'] = (nba['hwinodds']**-1) / ((nba['hwinodds']**-1) + (nba['hloseodds']**-1))
nba['HloseBmPr'] = (nba['hloseodds']**-1) / ((nba['hwinodds']**-1) + (nba['hloseodds']**-1))

# Step 3
nba['HwinBm'] = np.where(nba['HwinBmPr']>nba['HloseBmPr'], 1, 0)

# Step 4
# Skip for now

# Step 5
# Skip for now

# Step 6
nba['lgSalRat'] = np.log(nba['hteamsal'] / nba['opposal'])

#### Step 2: Estimate a logit model of home time wins depending on the log salary ratio, using the data for calendar year 2018 as the “training data”.
1. Define a subset for the calendar year 2018 data 
2. Import the logistic regression package (copy the code for this from Week 1)
3. Run the logistic regression of hwin on the log salary ratio (copy the code for this from Week 1 while changing the variable names to the ones required here)

In [17]:
# Step 1
train = nba[nba.year==2018]

# Step 2
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Step 3
logit_mod = smf.glm(formula='hwin~lgSalRat', data=train, family=sm.families.Binomial())
logit_result = logit_mod.fit()
display(logit_result.summary())

0,1,2,3
Dep. Variable:,hwin,No. Observations:,542.0
Model:,GLM,Df Residuals:,540.0
Model Family:,Binomial,Df Model:,1.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-359.61
Date:,"Tue, 11 Jan 2022",Deviance:,719.21
Time:,22:27:31,Pearson chi2:,543.0
No. Iterations:,4,Pseudo R-squ. (CS):,0.01149
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.4452,0.089,5.026,0.000,0.272,0.619
lgSalRat,1.1216,0.452,2.482,0.013,0.236,2.007


#### Step 3: Define the predicted probabilities and the predicted results, using the entire data set
1. The predicted probability of home win can be defined using the formula:

$$\frac{1}{1+\frac{1}{exp(b0 + b1*logsalaryratio)}}$$

> where b0 is the constant (the intercept) in the logistic regression and b1 is the coefficient for logsalaryratio.

2. Based on the predicted probability, define the predicted result H as the outcome where the predicted home win probability is greater than 0.5, and A otherwise.

In [19]:
# Step 1
train['lgSalRatWinProbs'] = logit_result.predict()  # or
nba['lgSalRatWinProbs'] = logit_result.predict(nba)

# Step 2
train['lgSalRatHwin'] = np.where(train['lgSalRatWinProbs']>0.5, 1, 0)  # or
nba['lgSalRatHwin'] = np.where(nba['lgSalRatWinProbs']>0.5, 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['lgSalRatWinProbs'] = logit_result.predict()  # or
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['lgSalRatHwin'] = np.where(train['lgSalRatWinProbs']>0.5, 1, 0)  # or


#### Step 4: For games played in 2019, compare the bookmaker probabilities and model probabilities in terms of the mean number of successfully predicted outcomes and the Brier scores.

1. Define the subset of games played in calendar year 2019
2. Define a dummy variable equal to 1 when the bookmaker result prediction is correct, and zero otherwise. Define the equivalent variable for the logit model prediction.
3. Calculate the means for each of these variables
4. Define the Brier score for the bookmaker probabilities and the Brier score for the logit model probabilities
5. Calculate the mean of each Brier score

In [25]:
# Step 1
nba19 = nba[nba.year==2019].copy()
nba19 = nba19.drop(columns=['day','month','Game(home-away)'])

# Step 2
nba19['BmTrue'] = np.where(nba19['hwin']==nba19['HwinBm'], 1, 0)
nba19['LgTrue'] = np.where(nba19['hwin']==nba19['lgSalRatHwin'], 1, 0)

# Step 3
Bm_mean = nba19['BmTrue'].mean()
Lg_mean = nba19['LgTrue'].mean()
print(f'Bookmaker Acc = {Bm_mean}\nLogit Acc = {Lg_mean}')

# Step 4
from sklearn.metrics import brier_score_loss
brierBm = brier_score_loss(nba19['BmTrue'], nba19['HwinBm'])  # Mult by 2 accounts for True Wins and True Losses
brierLg = brier_score_loss(nba19['LgTrue'], nba19['lgSalRatHwin'])
print(f'Bookmaker Brier = {brierBm}\nLogit Brier = {brierLg}')

# Step 5
# Unknown

Bookmaker Acc = 0.6918604651162791
Logit Acc = 0.5872093023255814
Bookmaker Brier = 0.42005813953488375
Logit Brier = 0.42005813953488375


In [27]:
# Quiz
# Q1
# 542

# Q2
# 1.1216

# Q3
# Both are at 5%

# Q4
# It is the predicted prob of home win (WRONG)
# No natural interpretation (WRONG)
# It reflects value of home advantage

# Q5
# 69%

# Q6
# 59%

# Q7
# 0.394

# Q8
# 0.399 (WRONG)
# 0.477

# Q9
# Probs closer to actual

# Q10
# Logistic more reliable