Building a football team rating system, broken down into attacking and defending ratings, which can be used to predict match results

In [1]:
import datetime
import math
from typing import Tuple

import pandas as pd

from autoscout.util import load_csv

Competition scorelines and expected goals data can be downloaded from [fbref](https://fbref.com) using [autoscout](https://github.com/olliestanley/autoscout/)

In [2]:
epl_df = load_csv("../data/raw/epl_2022_matches.csv")
epl_df = epl_df.dropna(axis=0, how="any").reset_index(drop=True)
epl_df.head()

Unnamed: 0,date,start_time,home_team,home_xg,score,away_xg,away_team,attendance,venue,referee,notes
0,2021-08-13,20:00,Brentford,1.3,2–0,1.4,Arsenal,16479.0,Brentford Community Stadium,Michael Oliver,0.0
1,2021-08-14,12:30,Manchester Utd,1.5,5–1,0.6,Leeds United,72732.0,Old Trafford,Paul Tierney,0.0
2,2021-08-14,15:00,Watford,1.0,3–2,1.2,Aston Villa,20051.0,Vicarage Road Stadium,Mike Dean,0.0
3,2021-08-14,15:00,Chelsea,0.9,3–0,0.3,Crystal Palace,38965.0,Stamford Bridge,Jonathan Moss,0.0
4,2021-08-14,15:00,Everton,2.4,3–1,0.7,Southampton,38487.0,Goodison Park,Andy Madley,0.0


We need to split `score` into `home_goals` and `away_goals`

In [3]:
epl_df[["home_goals", "away_goals"]] = epl_df["score"].str.split("–", expand=True)

We only need some identifying columns and the expected goals metrics, so we can filter out the other columns and preprocess the date column to allow for ease of comparison

We are going to use an Elo-style system with separate attacking and defending ratings for xG and shots, so we also create new columns with the default value of 1000

In [4]:
epl_df = epl_df[["date", "home_team", "home_xg", "away_xg", "away_team", "home_goals", "away_goals"]]

epl_df["date"] = pd.to_datetime(epl_df["date"])

epl_df["home_att_rating"] = 1000
epl_df["home_def_rating"] = 1000
epl_df["away_att_rating"] = 1000
epl_df["away_def_rating"] = 1000

epl_df.head()

Unnamed: 0,date,home_team,home_xg,away_xg,away_team,home_goals,away_goals,home_att_rating,home_def_rating,away_att_rating,away_def_rating
0,2021-08-13,Brentford,1.3,1.4,Arsenal,2,0,1000,1000,1000,1000
1,2021-08-14,Manchester Utd,1.5,0.6,Leeds United,5,1,1000,1000,1000,1000
2,2021-08-14,Watford,1.0,1.2,Aston Villa,3,2,1000,1000,1000,1000
3,2021-08-14,Chelsea,0.9,0.3,Crystal Palace,3,0,1000,1000,1000,1000
4,2021-08-14,Everton,2.4,0.7,Southampton,3,1,1000,1000,1000,1000


Now we need to write the code for the system itself

First we need a function which grabs the latest ratings for a team as of a particular date; fairly self-explanatory in that we just grab the latest rating before the given cut-off date

Most of this function is just dealing with edge cases (team has not played a home game, or away game, or perhaps no game at all, before given date)

In [5]:
def get_ratings_before_date(df: pd.DataFrame, team: str, date: datetime.datetime, default: float) -> float:
    df = df[df["date"] < date]
 
    match_home = df[df["home_team"] == team]
    last_home = match_home.iloc[-1] if len(match_home) else None

    match_away = df[df["away_team"] == team]
    last_away = match_away.iloc[-1] if len(match_away) else None

    if last_home is not None and (last_away is None or last_home["date"] > last_away["date"]):
        return last_home["home_att_rating"], last_home["home_def_rating"]

    if last_away is not None:
        return last_away["away_att_rating"], last_away["away_def_rating"]

    return default, default

Second we need a function which predicts outcomes based on the ratings for two teams

This is a simple function, which starts from a `baseline` level of xG for each team and adjusts it by comparing that team's attacking rating to the opponent's defending rating

A home field advantage is also applied, with its strength determined by the value of `home_alpha`

In [6]:
def predict(
    home_att: float, home_def: float, away_att: float, away_def: float, baseline: float, home_alpha: float
) -> Tuple[float, float]:
    home_factor = baseline * (1 + home_alpha)
    away_factor = baseline * (1 - home_alpha)

    home_pred = home_factor / (1 + math.exp((away_def - home_att) / 400))
    away_pred = away_factor / (1 + math.exp((home_def - away_att) / 400))

    return home_pred, away_pred

Finally we need a function which applies updates to ratings based on the delta between predicted and actual outcomes

This function uses a parameter `k` to determine the sensitivity of ratings to new information

In [7]:
def update_rating(rating: float, delta: float, k: int) -> float:
    return rating + (k * delta)

To compute ratings, we iterate every match in the dataset

Each team starts with attacking and defending ratings of `1000` and these are updated after each game, with the new ratings then being used to predict the next game

In [8]:
K = 32
BASELINE = 2.8
HOME_ALPHA = 0.05
DEFAULT = 1000

for idx in range(len(epl_df)):
    row = epl_df.iloc[idx]

    home_att, home_def = get_ratings_before_date(epl_df, row["home_team"], row["date"], DEFAULT)
    away_att, away_def = get_ratings_before_date(epl_df, row["away_team"], row["date"], DEFAULT)

    home_pred, away_pred = predict(home_att, home_def, away_att, away_def, BASELINE, HOME_ALPHA)
    home_true, away_true = row["home_xg"], row["away_xg"]

    home_delta, away_delta = (home_true - home_pred), (away_true - away_pred)

    epl_df.loc[idx, "home_pred"] = home_pred
    epl_df.loc[idx, "away_pred"] = away_pred

    epl_df.loc[idx, "home_att_rating"] = update_rating(home_att, home_delta, K)
    epl_df.loc[idx, "home_def_rating"] = update_rating(home_def, -away_delta, K)
    epl_df.loc[idx, "away_att_rating"] = update_rating(away_att, away_delta, K)
    epl_df.loc[idx, "away_def_rating"] = update_rating(away_def, -home_delta, K)

We should expect the ratings to get progressively fairer or more accurate over a larger sample

To evaluate them we will take a look at the ratings prior to the final 10 matches in the dataset, and see the predicted vs actual xG metrics for those matches

In [9]:
epl_df.tail(10)

Unnamed: 0,date,home_team,home_xg,away_xg,away_team,home_goals,away_goals,home_att_rating,home_def_rating,away_att_rating,away_def_rating,home_pred,away_pred
370,2022-05-22,Brighton,1.7,0.3,West Ham,3,1,961.802877,1133.856663,925.577269,980.909917,1.39962,1.068986
371,2022-05-22,Brentford,1.2,1.4,Leeds United,1,2,952.425936,1015.34419,944.859318,757.684318,1.897152,1.1911
372,2022-05-22,Leicester City,2.3,1.2,Southampton,4,1,951.483757,908.599855,889.287755,844.964121,1.580959,1.309552
373,2022-05-22,Arsenal,3.1,1.1,Everton,5,1,1142.081207,1093.668725,860.447395,963.605684,1.623781,0.937004
374,2022-05-22,Crystal Palace,0.5,0.8,Manchester Utd,1,0,906.836478,1173.607814,1020.186945,981.495794,1.443808,1.109959
375,2022-05-22,Norwich City,0.3,3.7,Tottenham,0,5,739.318524,556.050687,1253.449332,1234.527105,0.693683,2.174518
376,2022-05-22,Burnley,1.4,1.8,Newcastle Utd,1,2,864.643178,914.436417,890.489905,1047.653795,1.106918,1.2297
377,2022-05-22,Liverpool,3.2,1.1,Wolves,3,1,1452.659453,1265.463469,752.936773,787.66455,2.419997,0.53847
378,2022-05-22,Manchester City,2.9,0.3,Aston Villa,3,2,1536.736057,1318.960419,909.908249,1055.00749,2.201675,0.740701
379,2022-05-22,Chelsea,2.4,1.0,Watford,2,1,1187.599927,1197.935555,804.555816,821.277279,2.067414,0.698972


The ratings seem pretty sensible when comparing them to the 2021-22 Premier League table!

We save the data with modifications

In [10]:
epl_df.to_csv("../data/ratings/epl_2022.csv", index=False)