<a href="https://colab.research.google.com/github/mikeogunmakin/river-medway-trading/blob/main/research/how_predictable_is_the_EPL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **How predictable Is the English Premier League?**

**Aim:** This analysis aims to understand how predictable the English Premier League is across the Home Win, Draw, and Away Win markets.

**Background**: This research aims to inform the development of systematic trading strategies as part of a machine learning  sports trading project. Follow on [Medium](https://medium.com/river-medway-trading) for updates and insights.

**Methodology**: Assess the predictive accuracy and calibration of the Betfair Exchange odds for English Premier League (EPL) matches (Win/Draw/Loss) by evaluating them using the Brier Score.

**Brier Score Explained:**

The [Brier Score](http://en.wikipedia.org/wiki/Brier_score) measures how close predicted probabilities are to the actual outcomes. It is used when a model outputs probabilities for mutually exclusive discrete outcomes - for example, predicting that Manchester United has a 58% chance of winning.

Formally, it’s defined as:

$$
\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2
$$

Where:

- $p_i$ — predicted probability of the positive class  
- $y_i \in \{0, 1\}$ — true label (either 1 or 0)  
- $N$ — number of samples
<br>

A lower Brier Score indicates a more accurate and better-calibrated model. We will use the Brier Score to evaluate the predicted odds on the Betfair Exchange (where we dont have exchange data we will use the average odds from bookmakers as a proxy).

### Import Packages

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Import Datasets

importing EPL data from the last 5 seasons. For more on the data sourcing strategy please read the article [here](https://medium.com/river-medway-trading/sourcing-data-for-sports-trading-research-1a4d5e744378).

In [56]:
epl_2020 = pd.read_csv('/content/drive/MyDrive/RMT/Data/data/2020 football season/E0.csv')
epl_2021 = pd.read_csv('/content/drive/MyDrive/RMT/Data/data/2021 football season/E0.csv')
epl_2022 = pd.read_csv('/content/drive/MyDrive/RMT/Data/data/2022 football season/E0.csv')
epl_2023 = pd.read_csv('/content/drive/MyDrive/RMT/Data/data/2023 football season/E0.csv')
epl_2024 = pd.read_csv('/content/drive/MyDrive/RMT/Data/data/2024 football season/E0.csv')

### Data Cleaning & Inspection

Firstly, we would expect each EPL season to have 380 matches.

In [57]:
epl_dict = {'2020':epl_2020,
            '2021':epl_2021,
            '2022':epl_2022,
            '2022':epl_2022,
            '2023':epl_2023,
            '2024':epl_2024}

for k,v in epl_dict.items():
  print(f'EPL {k} season:{v.shape[0]} matches')

EPL 2020 season:380 matches
EPL 2021 season:380 matches
EPL 2022 season:380 matches
EPL 2023 season:380 matches
EPL 2024 season:380 matches


Next, we would filter out the columns we are interested in for the analysis and check for nulls and missing values. Duplicates are unlikely since we have 380 rows.

In [58]:
def clean_dataset(dataset):
  cols = dataset.columns

  cols_to_keep = ['Date', 'Div', 'HomeTeam', 'AwayTeam', 'FTR']

  if 'BFEH' in cols:
    cols_to_keep.append('BFEH')
    cols_to_keep.append('BFED')
    cols_to_keep.append('BFEA')
  else:
    cols_to_keep.append('AvgH')
    cols_to_keep.append('AvgD')
    cols_to_keep.append('AvgA')

  return dataset[cols_to_keep]

In [59]:
# looping through dictionary to clean all the datasets
for k,v in epl_dict.items():
  v_cleaned = clean_dataset(epl_dict[k])
  epl_dict[k] = v_cleaned


Let's check for missing values in datasets

In [60]:
for k,v in epl_dict.items():
  print(f'EPL {k} season: {v.isnull().sum().sum()} nulls')

EPL 2020 season: 0 nulls
EPL 2021 season: 0 nulls
EPL 2022 season: 0 nulls
EPL 2023 season: 0 nulls
EPL 2024 season: 0 nulls


### Helper Functions

To help calculate the brier score, creating a few functions

In [61]:
def decimal_odds_to_prob(decimal_odds):
   return 1/decimal_odds

def home_win_outcome(FTR):
  if FTR == 'H':
    return 1
  else:
    return 0

def draw_outcome(FTR):
  if FTR == 'D':
    return 1
  else:
    return 0

def away_win_outcome(FTR):
  if FTR == 'A':
    return 1
  else:
    return 0

def squared_error(pred_prob, outcome):
  return (pred_prob - outcome)**2

def brier_score(pred_prob, outcome):
  return np.mean(np.square(pred_prob - outcome))

### Home Win Market

Loop through dictionary to calculate brier score for each season and the overall average score

In [62]:
brier_score_list = []

for k,v in epl_dict.items():
  epl_dict[k]['HomeWinOutcome'] = epl_dict[k]['FTR'].apply(home_win_outcome)

  if 'BFEH' in epl_dict[k].columns:
    epl_dict[k]['HomeWinProb'] = epl_dict[k]['BFEH'].apply(decimal_odds_to_prob)
  else:
    epl_dict[k]['HomeWinProb'] = epl_dict[k]['AvgH'].apply(decimal_odds_to_prob)

  epl_dict[k]['HomeWinSquaredError'] = squared_error(epl_dict[k]['HomeWinProb'], epl_dict[k]['HomeWinOutcome'])

  brier_score = np.mean(epl_dict[k]['HomeWinSquaredError'])

  brier_score_list.append(brier_score)

  print(f'EPL {k} season: Home Win Brier Score: {round(brier_score,3)}')


print(f'Average Home Win Brier Score: {round(np.average(brier_score_list),3)}')

EPL 2020 season: Home Win Brier Score: 0.204
EPL 2021 season: Home Win Brier Score: 0.199
EPL 2022 season: Home Win Brier Score: 0.214
EPL 2023 season: Home Win Brier Score: 0.198
EPL 2024 season: Home Win Brier Score: 0.204
Average Home Win Brier Score: 0.204


### Draw Market

In [64]:
brier_score_list = []

for k,v in epl_dict.items():
  epl_dict[k]['DrawOutcome'] = epl_dict[k]['FTR'].apply(draw_outcome)

  if 'BFED' in epl_dict[k].columns:
    epl_dict[k]['DrawProb'] = epl_dict[k]['BFED'].apply(decimal_odds_to_prob)
  else:
    epl_dict[k]['DrawProb'] = epl_dict[k]['AvgD'].apply(decimal_odds_to_prob)

  epl_dict[k]['DrawSquaredError'] = squared_error(epl_dict[k]['DrawProb'], epl_dict[k]['DrawOutcome'])

  brier_score = np.mean(epl_dict[k]['DrawSquaredError'])

  brier_score_list.append(brier_score)

  print(f'EPL {k} season: Draw Brier Score: {round(brier_score,3)}')


print(f'Average Draw Brier Score: {round(np.average(brier_score_list),3)}')

EPL 2020 season: Draw Brier Score: 0.171
EPL 2021 season: Draw Brier Score: 0.176
EPL 2022 season: Draw Brier Score: 0.176
EPL 2023 season: Draw Brier Score: 0.166
EPL 2024 season: Draw Brier Score: 0.184
Average Draw Brier Score: 0.175


### Away Market

In [66]:
brier_score_list = []

for k,v in epl_dict.items():
  epl_dict[k]['AwayWinOutcome'] = epl_dict[k]['FTR'].apply(away_win_outcome)

  if 'BFEA' in epl_dict[k].columns:
    epl_dict[k]['AwayWinProb'] = epl_dict[k]['BFEA'].apply(decimal_odds_to_prob)
  else:
    epl_dict[k]['AwayWinProb'] = epl_dict[k]['AvgA'].apply(decimal_odds_to_prob)

  epl_dict[k]['AwayWinSquaredError'] = squared_error(epl_dict[k]['AwayWinProb'], epl_dict[k]['AwayWinOutcome'])

  brier_score = np.mean(epl_dict[k]['AwayWinSquaredError'])

  brier_score_list.append(brier_score)

  print(f'EPL {k} season: Away Win Brier Score: {round(brier_score,3)}')


print(f'Average Away Win Brier Score: {round(np.average(brier_score_list),3)}')

EPL 2020 season: Away Win Brier Score: 0.223
EPL 2021 season: Away Win Brier Score: 0.179
EPL 2022 season: Away Win Brier Score: 0.184
EPL 2023 season: Away Win Brier Score: 0.169
EPL 2024 season: Away Win Brier Score: 0.19
Average Away Win Brier Score: 0.189


### Conclusion

This analysis indicates that market efficiency varies across markets (Home/Draw/Win). Since the brier scores (0.20 for Home, 0.18 for Draw, and 0.19 for Away) reflect market-implied probabilities, the Home market appears least efficient and therefore offers the greatest potential for model-based exploitation.

As a proxy, we used bookmaker odds when exchange odds were not available. However, its important to flag that bookmakers generally reduce their odds slightly so the implied probabilities sum to more than 100%. This therefore introduces bias into our analysis.

Betfair Exchange is not perfect either but its alot better

In [73]:
# example: we used bookmaker odds as a proxy for the betfair exchange
epl_dict['2020']['SumProb'] = epl_dict['2020']['HomeWinProb'] + epl_dict['2020']['DrawProb'] + epl_dict['2020']['AwayWinProb']

epl_dict['2020']['SumProb'].mean()

np.float64(1.0434657583023477)

In [75]:
# example: wher we used betfair exchange data
epl_dict['2024']['SumProb'] = epl_dict['2024']['HomeWinProb'] + epl_dict['2024']['DrawProb'] + epl_dict['2024']['AwayWinProb']

epl_dict['2024']['SumProb'].mean()

np.float64(1.006202295142935)

### Next Steps


*   Automate Brier Score Analysis and complete analysis for other leagues - Nov 2025
*   Research and build a heuristic benchmark models based on findings - Nov 2025
*   Research and build data driven ML models - Dec 2025
*   Evaluate models and build a backtesting system - Dec 2025
