This simulation uses Elo ratings from <https://eloratings.net> to measure team strength and update it after each simulated game. The Elo implementation is based on FiveThirtyEight’s NFL forecasting game (<https://github.com/morales-felix/nfl-elo-game>).

Notes on Elo implementation:  

- Per <https://eloratings.net/about>, the K constant is set to 50 as this is a continental competition.  
- Probabilities given by the Elo rating system are binary. I came up with a workaround to convert them to ternary probabilities given that association football (a.k.a. soccer) admits three outcomes after a match (win, tie, lose).  
- I did not simulate scorelines. Rather, I simply used probabilities to decide whether a team would win, tie, or lose. As such, I did not use the goal difference multiplier specified in <https://eloratings.net/about>  
- I'll be happy to talk about the workaround, but I wouldn't take it as gospel. There might be ways to do this, but I did not research it. Wanted to have fun, not produce an academic-paper-worthy method, nor a sellable product.  
- In the end, results aren't that different from other more publicized methods. Brazil is a strong team... always.

In [1]:
import pandas as pd
import csv
from tqdm import tqdm

from src.euro_cup_simulator import *

Since I want to simulate the group stage many times to generate a distribution of outcomes, I will use the `joblib` module to parallelize the simulation. This will allow me to run the simulation many times in a reasonable amount of time. That requires me to use a function to simulate the group stage and return the results.

In [2]:
def run_group_stage_simulation(n, j):
    """
    Run a simulation of the group stage of the Euro 2024
    """
    
    teams_pd = pd.read_csv("data/roster.csv")
    
    for i in range(n):
        games = read_games("data/matches.csv")
        teams = {}
    
        for row in [
            item for item in csv.DictReader(open("data/roster.csv"))
            ]:
            teams[row['team']] = {
                'name': row['team'],
                'rating': float(row['rating']),
                'points': 0
                }
    
        simulate_group_stage(
            games,
            teams,
            ternary=True
            )
    
        collector = []
        for key in teams.keys():
            collector.append(
                {"team": key,
                 f"simulation{i+1}": teams[key]['points']}
            )

        temp = pd.DataFrame(collector)
        teams_pd = pd.merge(teams_pd, temp)
    
    sim_cols = [
        a for a in teams_pd.columns if "simulation" in a]
    teams_pd[
        f"avg_pts_{j+1}"
        ] = teams_pd[sim_cols].mean(axis=1)
    not_sim = [
        b for b in teams_pd.columns if "simulation" not in b]
    simulation_result = teams_pd[not_sim]
    
    return simulation_result

### Simulate group stage

#### The gist is to read from two files: One defining the match schedule, the other with teams and their relative strengths (given by Elo ratings prior to the start of the event)

In [3]:
# Reads in the matches and teams as dictionaries and proceeds with that data type
n = 100 # How many simulations to run
m = 100 # How many simulation results to collect
from joblib import Parallel, delayed

roster_pd = Parallel(n_jobs=5)(
    delayed(run_group_stage_simulation)(
        n, j) for j in tqdm(range(m)))

for t in tqdm(range(m)):
    if t == 0:
        roster = pd.merge(
            roster_pd[t],
            roster_pd[t+1]
            )
    elif t >= 2:
        roster = pd.merge(
            roster,
            roster_pd[t]
            )
    else:
        pass

100%|██████████| 100/100 [00:14<00:00,  6.73it/s]
100%|██████████| 100/100 [00:00<00:00, 287.81it/s]


In [4]:
sim_cols = [i for i in roster.columns if "avg_pts" in i]

In [5]:
roster['avg_sim_pts'] = roster[sim_cols].mean(axis=1)
roster['99%CI_low'] = roster[sim_cols] \
    .quantile(q=0.005, axis=1)
roster['99%CI_high'] = roster[sim_cols] \
    .quantile(q=0.995, axis=1)

In [6]:
not_sim = [
    j for j in roster.columns if "avg_pts" not in j]

#### Simulation is done, now take a look at the results for the group stage  
- All of this is based on Elo ratings from before the competition started (June 13, 2024)

In [7]:
roster[not_sim].sort_values(
    by=[
        'group',
        'avg_sim_pts'
        ],
    ascending=False
    )

Unnamed: 0,group,team,rating,avg_sim_pts,99%CI_low,99%CI_high
22,F,Portugal,2003,4.9749,4.63495,5.34505
20,F,Turkey,1749,4.009,3.5299,4.53505
23,F,Czech Republic,1777,3.2317,2.83475,3.70555
21,F,Georgia,1666,3.1948,2.8398,3.63585
16,E,Belgium,1988,6.6367,6.1099,6.9603
19,E,Ukraine,1853,3.8259,3.40495,4.3712
17,E,Slovakia,1671,2.7756,2.4198,3.2203
18,E,Romania,1647,2.5673,2.2199,3.02525
15,D,France,2077,5.2999,4.95465,5.86575
13,D,Netherlands,1974,4.2941,3.89465,4.71525


This tournament is much harder to forecast than Copa America 2024!  
- Group A: While Germany should come out on top, the other three teams come in with overlaps in the 99% confidence intervals for the simulated number of points they could accumulate in the group stage. That means, it's a toss-up between the other three teams.  
- Group B: Yet another toss up between the top three teams (rather than the bottom three like in group A). Albania should definitely end up in last place.  
- Group C: England should definitely qualify, but it's hard to tell how the teams will sort themselves out since there's overlaps in the 99% confidence intervals between England and Denmark, and between Denmark and Serbia, and between Serbia and Slovenia. I'll call the shots here and say: England, Denmark, Serbia, Slovenia. While there's overlap, it's not like the overlaps in group A and B, where three teams are in the mix. Here, the overals are between two teams at a time.  
- Group D: Easy, should be France and Netherlands, in that order. I'll call the shot and say Austria will come in third, but there is a chance Poland might take that spot.  
- Group E: Apparently, Belgium should *easily* top this group... followed by Ukraine... And the fight should be between Slovakia and Romania for third place. But if we learned anything from the first match date, this might turn out to be super wrong.  
- Group F: Portugal should top this group, and I will have to call the shot with Turkey. I say that since the bottom three teams overlap with each other.  

All in all, there's way more variability in the Euros than in Copa America, where the only difficul group is group A, and only with repsect to who will come in second. The rest of the groups sort themselves out pretty easily.

### Simulating knockout stage  
Here's where it gets interesting

In [None]:
# Now, doing the Monte Carlo simulations
n = 10000
playoff_results_teams = []
playoff_results_stage = []

for i in tqdm(range(n)):
    overall_result_teams = dict()
    overall_result_stage = dict()
    games = read_games("data/playoff_matches.csv")
    teams = {}
    
    for row in [
        item for item in csv.DictReader(open("data/playoff_roster.csv"))]:
        teams[row['team']] = {
            'name': row['team'],
            'rating': float(row['rating'])
            }
    
    simulate_playoffs(games, teams, ternary=True)
    
    playoff_pd = pd.DataFrame(games)
    
    # This is for collecting results of simulations per team
    for key in teams.keys():
        overall_result_teams[key] = collect_playoff_results(
            key,
            playoff_pd
            )
    playoff_results_teams.append(overall_result_teams)
    
    # Now, collecting results from stage-perspective
    overall_result_stage['whole_bracket'] = playoff_pd['advances'].to_list()
    overall_result_stage['Quarterfinals'] = playoff_pd.loc[playoff_pd['stage'] == 'eigths_finals', 'advances'].to_list()
    overall_result_stage['Semifinals'] = playoff_pd.loc[playoff_pd['stage'] == 'quarterfinals', 'advances'].to_list()
    overall_result_stage['Final'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'advances'].to_list()
    overall_result_stage['third_place_match'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'loses'].to_list()
    overall_result_stage['fourth_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'loses'].to_list()[0]
    overall_result_stage['third_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'advances'].to_list()[0]
    overall_result_stage['second_place'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'loses'].to_list()[0]
    overall_result_stage['Champion'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'advances'].to_list()[0]
    overall_result_stage['match8'] = list(playoff_pd.loc[8, ['home_team', 'away_team']])
    overall_result_stage['match9'] = list(playoff_pd.loc[9, ['home_team', 'away_team']])
    overall_result_stage['match10'] = list(playoff_pd.loc[10, ['home_team', 'away_team']])
    overall_result_stage['match11'] = list(playoff_pd.loc[11, ['home_team', 'away_team']])
    overall_result_stage['match12'] = list(playoff_pd.loc[12, ['home_team', 'away_team']])
    overall_result_stage['match13'] = list(playoff_pd.loc[13, ['home_team', 'away_team']])
    overall_result_stage['match14'] = list(playoff_pd.loc[14, ['home_team', 'away_team']])
    overall_result_stage['match15'] = list(playoff_pd.loc[15, ['home_team', 'away_team']])
    
    playoff_results_stage.append(overall_result_stage)

In [None]:
results_teams = pd.DataFrame(playoff_results_teams)

In [None]:
results_teams['Italy'].value_counts()

In [None]:
results_stage = pd.DataFrame(playoff_results_stage)

In [None]:
results_stage['Champion'].value_counts()