## Sumary

Perform a one-sample, one-sided $z$-test of a population proportion.

Data on the full time results of all football games in the English premier Leage 2018/19 season were obtained from [www.football-data.co.uk](https://www.football-data.co.uk/englandm.php).

We used a one-sample *z*-test to test the null hypothesis that there is no home advantage.
If there was no home advantage, then we would expect that approximately a third of the games would be won by the home team.
(This was the null hypothesis.)
Otherwise, if there is a home advantage, then we would expect that the proportion of games won by the ome team would be greater than a third.
(This was the alternative hypothesis.)

More formally, we tested the hypotheses:

$$
H_{0}: p = \frac{1}{3};
\hspace{3mm} H_{1}: p > \frac{1}{3},
$$

where $p$ is the proportion of games won by the home team.

General workflow:

1. Load the data
1. Describe the data
   - Given the data were nominal, we used a frequency table to describe them
1. Plot the data
1. Get an interval estimate[^1]
1. Perform the hypothesis test[^1]

This topic was covered in M248, Units 8 and 9.

## Dependencies

In [None]:
import pandas as pd
from statsmodels.stats import proportion as pr
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
sns.set_theme()

## Functions

In [None]:
def frequency(s: pd.Series, x: object) -> int:
    return s.map(lambda y: 1 if x == y else 0).sum()

## Constants

In [None]:
URL = ('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
       + '/main/epl_results/season-1819.csv')

## Main

### Load data

In [None]:
results = pd.read_csv(URL)

### Get data of interest

In [None]:
ftr = results.get(
    'FTR'
).rename(
    'result'
).replace(
    {'H': 'home_win', 'A': 'away_win', 'D': 'draw'}
)
ftr.info()

### Describe the data

In [None]:
_g = ftr.groupby(ftr).size().rename('frequency').to_frame()
_g['proportion'] = _g['frequency'].transform(lambda x: x / _g.sum())
_g

### Visualise the data

In [None]:
_gs = sns.countplot(x=ftr)
plt.title('Final results of the English Premier Leage 2018/19 season')
plt.ylabel('frequency')
plt.show()

### Get the results

Variable `x` is the number of games won by the home team;
and variable `n` is the total number of games in the season.

In [None]:
x = ftr.transform(lambda x: 1 if x == 'home_win' else 0).sum()
n = ftr.size

### Interval estimate of the proportion

In [None]:
pd.Series(data=pr.proportion_confint(x, n), index=['lcb', 'ucb'])

### Perform the *z*-test

In [None]:
pd.Series(
    data=pr.proportions_ztest(x, n, value=1/3, alternative='larger'),
    index=['zstat', 'pvalue']
).round(6)

## Footnotes

[^1]: [statsmodels.stats.proportion.proportion_confint](https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html)
[^1]: [statsmodels.stats.proportion.proportions_ztest](https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportions_ztest.html)