# DATA 335 - Winter 2025 - Lab 6

## Regression modelling for sports analytics

2025.03.04, 14:00-15:50, MS 521

In this lab, we'll study an application of logistic regression to a sports analytics problem. Specifically, we'll try to quantify evidence for the existence of a **home-field advantage** in the [CFL](https://www.cfl.ca) (Canadian Football League).

In the file `data/2024CFLScores.csv` is a record of all 95 CFL games played during the 2024 season.

In [167]:
import pandas as pd

df = pd.read_csv("../data/2024CFLScores.csv")
df

Unnamed: 0,week,visitor,host,visitor_score,host_score
0,Preseason Week 1,WPG,SSK,12,25
1,Preseason Week 2,OTT,HAM,31,22
2,Preseason Week 2,SSK,EDM,28,27
3,Preseason Week 2,BC,CGY,6,30
4,Preseason Week 2,TOR,MTL,13,30
...,...,...,...,...,...
90,Eastern Semi-Final,OTT,TOR,38,58
91,Western Semi-Final,BC,SSK,19,28
92,Eastern Final,TOR,MTL,30,28
93,Western Final,SSK,WPG,22,38


Let $p_{ij}$ be the probability of the team $i$, the home team, winning a game against team $j$, the visiting team, is characterized by
$$
\log\left(\frac{p_{ij}}{1 - p_{ij}}\right) = \alpha + \beta_i - \beta_j.\tag{$*$}
$$

Here, $\alpha$ quantifies the home field advantage while $\beta_i$ and $\beta_j$ represent the strengths of teams $i$ and $j$, respectively. Thus, the home team is expected to win if its strentgh, plus the home field advantage, exceeds the strength of the visiting team.

Model ($*$) is a version the [*Bradley-Terry ranking model*](https://en.wikipedia.org/wiki/Bradley–Terry_model), modified to include home-field advantage.

Introduce an indicator variable $x_t$ for each team $t$ that takes the value $1$, $-1$, or $0$ for game $k$ according to whether team $t$ is the home team for game $k$, the visiting team for game $k$, or not involved in game $k$. 

**To do:** Generate the array `X` of shape `(95, 9)` as described above.

Encoded in this way, the right-hand side of ($*$) can be expressed as
$$
\alpha + \sum_t \beta_tx_t.
$$

**To do:** Generate a "home-team-wins" binary variable `y` of shape `(95,)`. Fit a logistic regression of `y` on `X` to estimate the parameters $\alpha$ and $\beta_t$. You may run into a problem involving tie games. I'll leave it to you to come up with a solution.

  - Rank the teams from best to worst in terms of decreasing value of $\beta_t$. Does this ranking match the league standings for 2024?

  - For which pairs of teams does the expected result of a game between them depend on where the game is being played? That is, which pairs of teams are sufficiently close in skill level that the estimated home field advantage makes up the gap, leading to the less skilled team being favored to win if the game is played in its city?

### For your interest

I generated the data file by scraping this [page on](https://www.cfl.ca/schedule/2024/) the CFL web site using `bs4`. My code is below.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

html = open("../data/2024CFLSeasonScheduleAndScores.html", "r").read()
soup = BeautifulSoup(html)

week = []
visitor = []
visitor_score = []
host = []
host_score = []

for schedule_week in soup.select(".schedule-week"):
    visitors = [
        span.text.strip()[:3].strip() for span in schedule_week.select("span.visitor")
    ]
    visitor.extend(visitors)
    visitor_scores = [
        int(span.text.strip()) for span in schedule_week.select("span.visitor-score")
    ]
    visitor_score.extend(visitor_scores)
    hosts = [
        span.text.strip()[:3].strip() for span in schedule_week.select("span.host")
    ]
    host.extend(hosts)
    host_scores = [
        int(span.text.strip()) for span in schedule_week.select("span.host-score")
    ]
    host_score.extend(host_scores)
    h2 = schedule_week.find("h2")
    assert h2 is not None
    weeks = [h2.text for _ in visitors]
    week.extend(weeks)

df = pd.DataFrame(
    {
        "week": week,
        "visitor": visitor,
        "host": host,
        "visitor_score": visitor_score,
        "host_score": host_score,
    }
)

df