In [158]:
import pandas as pd
import numpy as np

results = pd.read_csv("1_bradley_terry_model.csv")
results.head(10)

Unnamed: 0,Date,Away Team,Away Pts,Home Team,Home Pts
0,Mar 22 (Thu 7:25pm),Carlton,95,Richmond,121
1,Mar 23 (Fri 7:50pm),Adelaide,87,Essendon,99
2,Mar 24 (Sat 3:35pm),Brisbane Lions,82,St Kilda,107
3,Mar 24 (Sat 4:35pm),Fremantle,60,Port Adelaide,110
4,Mar 24 (Sat 7:25pm),North Melbourne,39,Gold Coast,55
5,Mar 24 (Sat 7:25pm),Collingwood,67,Hawthorn,101
6,Mar 25 (Sun 1:10pm),Western Bulldogs,51,GWS Giants,133
7,Mar 25 (Sun 3:20pm),Geelong,97,Melbourne,94
8,Mar 25 (Sun 7:20pm),Sydney,115,West Coast,86
9,Mar 29 (Thu 7:50pm),Richmond,82,Adelaide,118


Add a few basic housekeeping columns: 
- hfa placeholder column
- game total
- mov
- game_result ("did the home team win?")

In [159]:
results['hfa'] = 1
results['game_total'] = results['Home Pts'] + results['Away Pts']
results['home_mov'] = results['Home Pts'] - results['Away Pts']
results['game_result'] = (results['Home Pts'] > results['Away Pts']).astype(int)

To be able to run a regression based on pairwise data, we need to arrive at the following structure:
- Need a column for HFA that is also set to 1.
    - After running the regression, the value for that column will also represent the amount that HFA matters.
    
- a column for each team.
- a 1 in that row if that team was present in that game.
- some way to designate which of the two teams with a "1" were playing in that particular game.
    - After running the regression, the values for these columns represent the logistic_rating for each team 


Add a column for each team and set that column to 1 if that team played in that game (should be two "1"s per row).

In [160]:
teams = set(results['Away Team'].unique().tolist() + results['Home Team'].unique().tolist())
teams

{'Adelaide',
 'Brisbane Lions',
 'Carlton',
 'Collingwood',
 'Essendon',
 'Fremantle',
 'GWS Giants',
 'Geelong',
 'Gold Coast',
 'Hawthorn',
 'Melbourne',
 'North Melbourne',
 'Port Adelaide',
 'Richmond',
 'St Kilda',
 'Sydney',
 'West Coast',
 'Western Bulldogs'}

Add a column for each team

In [161]:
for team in teams:
    col_name = f"{team}"
    results[col_name] = 0
results.head(5)

Unnamed: 0,Date,Away Team,Away Pts,Home Team,Home Pts,hfa,game_total,home_mov,game_result,Richmond,...,Port Adelaide,Melbourne,Adelaide,Essendon,Carlton,Geelong,Fremantle,Western Bulldogs,Brisbane Lions,Collingwood
0,Mar 22 (Thu 7:25pm),Carlton,95,Richmond,121,1,216,26,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Mar 23 (Fri 7:50pm),Adelaide,87,Essendon,99,1,186,12,1,0,...,0,0,0,0,0,0,0,0,0,0
2,Mar 24 (Sat 3:35pm),Brisbane Lions,82,St Kilda,107,1,189,25,1,0,...,0,0,0,0,0,0,0,0,0,0
3,Mar 24 (Sat 4:35pm),Fremantle,60,Port Adelaide,110,1,170,50,1,0,...,0,0,0,0,0,0,0,0,0,0
4,Mar 24 (Sat 7:25pm),North Melbourne,39,Gold Coast,55,1,94,16,1,0,...,0,0,0,0,0,0,0,0,0,0


In [162]:
for row in results.iterrows():
    asdict = row[1].to_dict()
    results.loc[row[0], asdict['Home Team']] = 1
    results.loc[row[0], asdict['Away Team']] = -1
results.head(10)

Unnamed: 0,Date,Away Team,Away Pts,Home Team,Home Pts,hfa,game_total,home_mov,game_result,Richmond,...,Port Adelaide,Melbourne,Adelaide,Essendon,Carlton,Geelong,Fremantle,Western Bulldogs,Brisbane Lions,Collingwood
0,Mar 22 (Thu 7:25pm),Carlton,95,Richmond,121,1,216,26,1,1,...,0,0,0,0,-1,0,0,0,0,0
1,Mar 23 (Fri 7:50pm),Adelaide,87,Essendon,99,1,186,12,1,0,...,0,0,-1,1,0,0,0,0,0,0
2,Mar 24 (Sat 3:35pm),Brisbane Lions,82,St Kilda,107,1,189,25,1,0,...,0,0,0,0,0,0,0,0,-1,0
3,Mar 24 (Sat 4:35pm),Fremantle,60,Port Adelaide,110,1,170,50,1,0,...,1,0,0,0,0,0,-1,0,0,0
4,Mar 24 (Sat 7:25pm),North Melbourne,39,Gold Coast,55,1,94,16,1,0,...,0,0,0,0,0,0,0,0,0,0
5,Mar 24 (Sat 7:25pm),Collingwood,67,Hawthorn,101,1,168,34,1,0,...,0,0,0,0,0,0,0,0,0,-1
6,Mar 25 (Sun 1:10pm),Western Bulldogs,51,GWS Giants,133,1,184,82,1,0,...,0,0,0,0,0,0,0,-1,0,0
7,Mar 25 (Sun 3:20pm),Geelong,97,Melbourne,94,1,191,-3,0,0,...,0,1,0,0,0,-1,0,0,0,0
8,Mar 25 (Sun 7:20pm),Sydney,115,West Coast,86,1,201,-29,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Mar 29 (Thu 7:50pm),Richmond,82,Adelaide,118,1,200,36,1,-1,...,0,0,1,0,0,0,0,0,0,0


## Run the Logistic Regression With StatsModels

But just running a stock logistic regression is not quite right. 

We specifically want it to use the following function: `1/(1+e**(-(HFA + HT - AT)))`

where:
- `HFA` = home field advantage
- `HT` = home team
- `AT` = away team

See [Statsmodels Documentation](https://www.statsmodels.org/dev/examples/notebooks/generated/generic_mle.html) for background on how to do this.

In [163]:
from statsmodels.discrete.discrete_model import Logit

X_cols = ['hfa'] + list(teams)
X = results[X_cols]
y = results['game_result']

model = Logit(y, X)
result = model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.493019
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:            game_result   No. Observations:                  198
Model:                          Logit   Df Residuals:                      180
Method:                           MLE   Df Model:                           17
Date:                Wed, 15 Jan 2020   Pseudo R-squ.:                  0.2854
Time:                        18:07:09   Log-Likelihood:                -97.618
converged:                       True   LL-Null:                       -136.60
                                        LLR p-value:                 8.826e-10
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
hfa                  0.2435      0.178      1.371      0.171      -0.105       0.592
Richmond   

  bse_ = np.sqrt(np.diag(self.cov_params()))
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


## Run the Logistic Regression With Scikit Learn

In [164]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)
print(X_cols)
print(model.coef_)
print(model.intercept_)

['hfa', 'Richmond', 'St Kilda', 'Sydney', 'GWS Giants', 'Hawthorn', 'North Melbourne', 'Gold Coast', 'West Coast', 'Port Adelaide', 'Melbourne', 'Adelaide', 'Essendon', 'Carlton', 'Geelong', 'Fremantle', 'Western Bulldogs', 'Brisbane Lions', 'Collingwood']
[[ 0.10315611  1.28042479 -1.14388266  0.61769174  0.51593326  0.66342965
   0.06734911 -1.33190497  0.87835988  0.19494668  0.43559295  0.22111071
   0.26134314 -1.73973762  0.43584334 -0.48682218 -0.49502591 -1.01990007
   0.64524816]]
[0.10315611]




## Background

#### Squared2020 Article

Just bite the bullet and read up on the Squared2020 Article [here](https://squared2020.com/2017/11/09/bradley-terry-rankings-introduction-to-logistic-regression/).

> We focus on a probabilistic framework for measuring pairwise comparisons of teams through the use of home-away indicators and win-loss results.

> Focuses on pairwise matchups between teams, dependent on location -- the result is whether a particular team came away with a victory.

^^ Yes, this is our usecase.

> Each row is a game.

> Home team is delineated as a +1. Away team as a -1.

Interesting, so one row of data looks like: 

__(0,-1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)__

the -1 is the away team, the +1 is the home team, and the last 1 is an intercept.

> Persisting with usual least-squares method will lead us to predicting __real values__ instead of __win-loss__ which is what we are ultimately after. To account for this, enter __logistic regression__.

> Logistic Regression is a methodology for identifying a regression model for __binary response data__ (who won and who lost). 

Idea: Just have home tema have a +1 and away team have a -1. And add in the HFA into the formula as before.

#### My StackOverflow Post

From [my SO post](https://stackoverflow.com/questions/59741811/retain-data-labels-in-statsmodels-likehood-model), a comment was: 
> Why don't you use the premade models Logit or GLM with family Binomial? Using the formula interface automatically transforms categorical (e.g. string) variables. Without using formula, the user has to provide numeric data, that is directly used in the models.

#### Machine Learning Mastery

[https://machinelearningmastery.com/logistic-regression-for-machine-learning/](Here): Helps to explain that Logistic Regression is predicting the probability of the home team winning. 

The probability is expressed in the following way: 

`ln(odds) = b0 + b1 * X`

`odds = e^(b0 + b1 * X)`

So logistic regression assumes a linear relationship between the parameters and the natural logarithm of the odds of the default class.

#### StackOverflow post

https://stats.stackexchange.com/questions/440242/statsmodels-logistic-regression-adding-intercept helps to clarify the role of the intercept in a logistic regression equation.

In [171]:
a = ['Richmond',
 'St Kilda',
 'Sydney',
 'GWS Giants',
 'Hawthorn',
 'North Melbourne',
 'Gold Coast',
 'West Coast',
 'Port Adelaide',
 'Melbourne',
 'Adelaide',
 'Essendon',
 'Carlton',
 'Geelong',
 'Fremantle',
 'Western Bulldogs',
 'Brisbane Lions',
 'Collingwood']

b = [1.28042479,
-1.14388266,
0.61769174,
0.51593326,
0.66342965,
0.06734911,
-1.33190497,
0.87835988,
0.19494668,
0.43559295,
0.22111071,
0.26134314,
-1.73973762,
0.43584334,
-0.48682218,
-0.49502591,
-1.01990007,
0.64524816]

In [172]:
zipped = list(zip(a, b))
zipped

[('Richmond', 1.28042479),
 ('St Kilda', -1.14388266),
 ('Sydney', 0.61769174),
 ('GWS Giants', 0.51593326),
 ('Hawthorn', 0.66342965),
 ('North Melbourne', 0.06734911),
 ('Gold Coast', -1.33190497),
 ('West Coast', 0.87835988),
 ('Port Adelaide', 0.19494668),
 ('Melbourne', 0.43559295),
 ('Adelaide', 0.22111071),
 ('Essendon', 0.26134314),
 ('Carlton', -1.73973762),
 ('Geelong', 0.43584334),
 ('Fremantle', -0.48682218),
 ('Western Bulldogs', -0.49502591),
 ('Brisbane Lions', -1.01990007),
 ('Collingwood', 0.64524816)]