# Analysing the home advantage in the English Premier Leage

## Notes

### Question of interest

The coronavirus panademic meant that all game in the English Premier
League we played in front of empty stadiums.
It is claimed that this resulted in a diminishing of the home team
advantage.
In the second analysis, we test if the proportion of home team wins
in the 17/18 and 18/19 seasons (*pre-COVID*) was equal to that of the
20/21 season (*post-COVID*).

### Data

- Data comprised of the full results for the 18/18, 18/19, and 20/21
  seasons
- Data fields
  - **game_id** `int` : game number, unique per season
  - **season** `int` : the season of the game, one of
    [1718, 1819, 2021]
  - **homegoals** `int` : number of goals scored by the home team
  - **awaygoals** `int` : number of goals scored by the away team

### Method

- Data comprised on 1140 game results
- Identified the result of the game, home win or not
  - Calculated the goal difference (homegoals - awaygoals)
  - If goal difference > 0, then home win, else not home win
- Each sample modelled by a normal approximation to the binomial
  - Justified by the sample sizes (size == 380)
- Calcuated the proportion  won by the home team, and 95%
  **z**-interval for the proportion for the combined 17/18
  and 18/19 seasons (*pre-COVID*) and the 20/21 season (*post-COVID*)
- Tested the hypothesis that the proportion of games won by the
  home team *pre-COVID* was equal to that *post-COVID*

## Full results

### Setup the notebook

In [1]:
# import packages and modules
import statsmodels.stats.proportion as sm
import pandas as pd
import sys

In [2]:
# import custom modules not in root
sys.path[0] = "..\\"  # update path
from src import load, describe, summarise  # noqa: E402

In [3]:
# declare functions
def is_home_win(x: int) -> int:
    if x > 0:
        return 1
    else:
        return 0

### Import the data

In [4]:
# get data
epl: pd.DataFrame = load.Data.get("epl_results")

In [5]:
# preview data
epl.head()

Unnamed: 0,game_id,season,homegoals,awaygoals
0,1,1718,4,3
1,2,1718,0,2
2,3,1718,2,3
3,4,1718,0,3
4,5,1718,1,0


In [6]:
# check dtypes
epl.dtypes

game_id      int64
season       int64
homegoals    int64
awaygoals    int64
dtype: object

### Transform the data

In [7]:
# get goal difference
epl["goaldiff"] = epl["homegoals"] - epl["awaygoals"]

In [8]:
# identify if home win or not
epl["homewin"] = epl["goaldiff"].apply(is_home_win)

### Analyse the data

In [9]:
# get number of games per season
n: int = int(epl.index.size/3)

In [10]:
# get number of home wins per sample
w_pre: int = epl.query('season == 1718 | season == 1819')["homewin"].sum()
w_post: int = epl.query('season == 2021')["homewin"].sum()

In [11]:
describe.Proportion(w_pre/(2*n), sm.proportion_confint(w_pre, 2*n))

Proportion(p_hat=0.465789, zconfint_prop=(0.430325, 0.501254))

In [12]:
describe.Proportion(w_post/(n), sm.proportion_confint(w_post, n))

Proportion(p_hat=0.378947, zconfint_prop=(0.330171, 0.427724))

In [13]:
# get test results
zstat, pval = sm.test_proportions_2indep(w_pre, 2*n, w_post, n)
summarise.PropTest(zstat, pval)

ResultSummary(zstat=2.810184, pval=0.004951)