## Sumary

Perform a one-sample, one-sided $z$-test of a population proportion.

Data on the full time results of all football games in the English premier Leage 2018/19 season were obtained from [www.football-data.co.uk](https://www.football-data.co.uk/englandm.php).

We used a one-sample *z*-test to test the null hypothesis that there is no home advantage.
If there was no home advantage, then we would expect that approximately a third of the games would be won by the home team.
(This was the null hypothesis.)
Otherwise, if there is a home advantage, then we would expect that the proportion of games won by the ome team would be greater than a third.
(This was the alternative hypothesis.)

More formally, we tested the hypotheses:

$$
H_{0}: p = \frac{1}{3};
\hspace{3mm} H_{1}: p > \frac{1}{3},
$$

where $p$ is the proportion of games won by the home team.

General workflow:

1. Load the data
1. Describe the data
   - Given the data were nominal, we used a frequency table to describe them
1. Plot the data
1. Get an interval estimate[^1]
1. Perform the hypothesis test[^2]

This topic was covered in M248, Units 8 and 9.

## Dependencies

In [1]:
import pandas as pd
from statsmodels.stats import proportion as pr

## Constants

In [3]:
URL = ('https://raw.githubusercontent.com/ljk233/laughingrook-datasets'
       + '/main/epl_results/season-1819.csv')

## Main

### Load data

In [4]:
results = pd.read_csv(URL)

### Get data of interest

In [5]:
ftr = results.get(
    'FTR'
).rename(
    'result'
).replace(
    {'H': 'home_win', 'A': 'away_win', 'D': 'draw'}
)
ftr.info()

<class 'pandas.core.series.Series'>
RangeIndex: 380 entries, 0 to 379
Series name: result
Non-Null Count  Dtype 
--------------  ----- 
380 non-null    object
dtypes: object(1)
memory usage: 3.1+ KB


### Describe the data

In [6]:
_g = ftr.groupby(ftr).size().rename('frequency').to_frame()
_g['proportion'] = _g['frequency'].transform(lambda x: x / _g.sum())
_g

Unnamed: 0_level_0,frequency,proportion
result,Unnamed: 1_level_1,Unnamed: 2_level_1
away_win,128,0.336842
draw,71,0.186842
home_win,181,0.476316


### Get the results

Variable `x` is the number of games won by the home team;
and variable `n` is the total number of games in the season.

In [11]:
x = len([r for r in ftr if r == 'home_win'])
n = ftr.size

181

### Interval estimate of the proportion

In [None]:
pd.Series(data=pr.proportion_confint(x, n), index=['lcb', 'ucb'])

lcb    0.426100
ucb    0.526531
dtype: float64

### Perform the *z*-test

In [None]:
pd.Series(
    data=pr.proportions_ztest(x, n, value=1/3, alternative='larger'),
    index=['zstat', 'pvalue']
).round(6)

zstat     5.580747
pvalue    0.000000
dtype: float64

[^1]: [statsmodels.stats.proportion.proportion_confint](https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html)
[^2]: [statsmodels.stats.proportion.proportions_ztest](https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportions_ztest.html)