This simulation uses Elo ratings from <https://eloratings.net> to measure team strength and update it after each simulated game. The Elo implementation is based on FiveThirtyEight’s NFL forecasting game (<https://github.com/morales-felix/nfl-elo-game>).

Notes on Elo implementation:  

- Per <https://eloratings.net/about>, the K constant is set to 50 as this is a continental competition.  
- Probabilities given by the Elo rating system are binary. I came up with a workaround to convert them to ternary probabilities given that association football (a.k.a. soccer) admits three outcomes after a match (win, tie, lose).  
- I did not simulate scorelines. Rather, I simply used probabilities to decide whether a team would win, tie, or lose. As such, I did not use the goal difference multiplier specified in <https://eloratings.net/about>  
- I'll be happy to talk about the workaround, but I wouldn't take it as gospel. There might be ways to do this, but I did not research it. Wanted to have fun, not produce an academic-paper-worthy method, nor a sellable product.

In [1]:
import numpy as np
import pandas as pd
import csv
from tqdm import tqdm
from joblib import Parallel, delayed

from src.copa_america_simulator import *

Since I want to simulate the group stage many times to generate a distribution of outcomes, I will use the `joblib` module to parallelize the simulation. This will allow me to run the simulation many times in a reasonable amount of time. That requires me to use a function to simulate the group stage and return the results.

In [2]:
def run_group_stage_simulation(n, j):
    """
    Run a simulation of the group stage of the Copa America
    """
    
    teams_pd = pd.read_csv("data/roster.csv")
    
    for i in range(n):
        games = read_games("data/matches.csv")
        teams = {}
    
        for row in [
            item for item in csv.DictReader(open("data/roster.csv"))
            ]:
            teams[row['team']] = {
                'name': row['team'],
                'rating': float(row['rating']),
                'points': 0
                }
    
        simulate_group_stage(
            games,
            teams,
            ternary=True
            )
    
        collector = []
        for key in teams.keys():
            collector.append(
                {"team": key,
                 f"simulation{i+1}": teams[key]['points']}
            )

        temp = pd.DataFrame(collector)
        teams_pd = pd.merge(teams_pd, temp)
    
    sim_cols = [
        a for a in teams_pd.columns if "simulation" in a]
    teams_pd[
        f"avg_pts_{j+1}"
        ] = teams_pd[sim_cols].mean(axis=1)
    not_sim = [
        b for b in teams_pd.columns if "simulation" not in b]
    simulation_result = teams_pd[not_sim]
    
    return simulation_result

### Simulate group stage

#### The gist is to read from two files: One defining the match schedule, the other with teams and their relative strengths (given by Elo ratings prior to the start of the event)

In [3]:
# Reads in the matches and teams as dictionaries and proceeds with that data type
n = 100 # How many simulations to run
m = 100 # How many simulation results to collect

roster_pd = Parallel(n_jobs=5)(
    delayed(run_group_stage_simulation)(
        n, j) for j in tqdm(range(m)))

for t in tqdm(range(m)):
    if t == 0:
        roster = pd.merge(
            roster_pd[t],
            roster_pd[t+1]
            )
    elif t >= 2:
        roster = pd.merge(
            roster,
            roster_pd[t]
            )
    else:
        pass

100%|██████████| 100/100 [00:08<00:00, 11.27it/s]
100%|██████████| 100/100 [00:00<00:00, 564.66it/s]


In [4]:
sim_cols = [i for i in roster.columns if "avg_pts" in i]

In [5]:
roster['avg_sim_pts'] = roster[sim_cols].mean(axis=1)
roster['99%CI_low'] = roster[sim_cols] \
    .quantile(q=0.005, axis=1)
roster['99%CI_high'] = roster[sim_cols] \
    .quantile(q=0.995, axis=1)

In [6]:
not_sim = [
    j for j in roster.columns if "avg_pts" not in j]

#### Simulation is done, now take a look at the results for the group stage

In [7]:
roster[not_sim].sort_values(
    by=[
        'group',
        'avg_sim_pts'
        ],
    ascending=False
    )

Unnamed: 0,group,team,rating,avg_sim_pts,99%CI_low,99%CI_high
13,D,Colombia,2015,6.6927,6.3399,7.2001
12,D,Brazil,2028,5.3841,5.06495,5.72
14,D,Paraguay,1710,2.6821,2.33485,3.0601
15,D,Costa Rica,1620,1.4478,1.23,1.76525
9,C,Uruguay,1992,6.7087,6.2896,7.13545
8,C,United States,1790,4.7133,4.31495,5.3002
10,C,Panama,1698,2.8759,2.4798,3.3205
11,C,Bolivia,1592,1.8332,1.42475,2.1803
5,B,Ecuador,1876,5.7484,5.239,6.11535
4,B,Mexico,1791,4.59,4.0594,4.9807


You can see that:  
- Group A should see Argentina easily top the group. However, while Chile has a clear edge over Peru, it's not definitive since the confidence intervals between the two overlap. Canada should definitely end up in last place.  
- Group B should definitely be: Ecuador, Mexico, Venezuela and Jamaica. No overlap in confidence intervals, and if we look at their team strengths, it's clear the Ecuador tops the group. Although I would say this is the sleeper group... lower team strength of the tournament overall.  
- Group C should definitely be: Uruguay, United States, Panama, and Bolivia. Panama might at best get a win against Bolivia, just like the 2016 Copa America (when it drew against eventual finalists Argentina and Chile).  
- Group D should definitely be: Colombia, Brazil, Paraguay, Costa Rica.  

I guess the Euros are a bit more interesting...

### Simulating knockout stage  
Here's where it gets interesting

In [8]:
# Now, doing the Monte Carlo simulations
n = 10000
playoff_results_teams = []
playoff_results_stage = []

for i in tqdm(range(n)):
    overall_result_teams = dict()
    overall_result_stage = dict()
    games = read_games("data/playoff_matches1.csv")
    teams = {}
    
    for row in [
        item for item in csv.DictReader(open("data/playoff_roster1.csv"))]:
        teams[row['team']] = {
            'name': row['team'],
            'rating': float(row['rating'])
            }
    
    simulate_playoffs(games, teams, ternary=True)
    
    playoff_pd = pd.DataFrame(games)
    
    # This is for collecting results of simulations per team
    for key in teams.keys():
        overall_result_teams[key] = collect_playoff_results(
            key,
            playoff_pd
            )
    playoff_results_teams.append(overall_result_teams)
    
    # Now, collecting results from stage-perspective
    overall_result_stage['whole_bracket'] = playoff_pd['advances'].to_list()
    overall_result_stage['Semifinals'] = playoff_pd.loc[playoff_pd['stage'] == 'quarterfinals', 'advances'].to_list()
    overall_result_stage['Final'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'advances'].to_list()
    overall_result_stage['third_place_match'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'loses'].to_list()
    overall_result_stage['fourth_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'loses'].to_list()[0]
    overall_result_stage['third_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'advances'].to_list()[0]
    overall_result_stage['second_place'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'loses'].to_list()[0]
    overall_result_stage['Champion'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'advances'].to_list()[0]
    overall_result_stage['match4'] = list(playoff_pd.loc[4, ['home_team', 'away_team']])
    overall_result_stage['match5'] = list(playoff_pd.loc[5, ['home_team', 'away_team']])
    
    playoff_results_stage.append(overall_result_stage)

100%|██████████| 10000/10000 [01:22<00:00, 121.79it/s]


In [9]:
results_teams = pd.DataFrame(playoff_results_teams)

In [14]:
results_teams['Argentina'].value_counts(normalize=True)

Argentina
Champion         0.5527
Second_place     0.2545
Quarterfinals    0.0810
Third_place      0.0801
Fourth_place     0.0317
Name: proportion, dtype: float64

In [11]:
results_stage = pd.DataFrame(playoff_results_stage)

In [15]:
results_stage['Champion'].value_counts(normalize=True)

Champion
Argentina        0.5527
Colombia         0.1459
Brazil           0.1133
Uruguay          0.0936
Ecuador          0.0591
Mexico           0.0174
Chile            0.0116
United States    0.0064
Name: proportion, dtype: float64