# FIFA World Cup 2026 – Poisson Model & Monte Carlo Simulation

This notebook analyses a full-tournament prediction model for the 2026 FIFA World Cup, based on:

- Match data from international fixtures (2018–2025)
- A Poisson regression model estimating team attack/defence strength and home advantage
- Monte Carlo simulations of:
  - The group stage (match-by-match Poisson score sampling)
  - A 32-team knockout bracket (Round of 32 → Final)

**Key outputs:**

- Group-stage expectations (points, qualification odds, finishing positions)
- Knockout progression probabilities (R16, QF, SF, Final, Champion)
- Visualisations and short interpretations for portfolio / GitHub / LinkedIn.


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

# Display options
pd.set_option("display.max_columns", 50)
pd.set_option("display.float_format", lambda x: f"{x:0.4f}")

PROJECT_ROOT = "/Users/Ole_Kruse/Desktop/coding_project"
DATA_PROCESSED_DIR = os.path.join(PROJECT_ROOT, "data", "processed")

PROJECT_ROOT, DATA_PROCESSED_DIR

In [None]:
# Group-stage match-level probabilities
group_match_probs_path = os.path.join(
    DATA_PROCESSED_DIR, "wc2026_group_stage_match_probs.csv"
)
group_match_probs = pd.read_csv(group_match_probs_path, parse_dates=["date"])

# Full-tournament simulation summary (group + knockout)
full_tournament_path = os.path.join(
    DATA_PROCESSED_DIR, "wc2026_full_tournament_simulation_summary.csv"
)
tournament_summary = pd.read_csv(full_tournament_path)

group_match_probs.head(), tournament_summary.head()


## 1. Data overview

We first inspect the two main outputs:

1. `wc2026_group_stage_match_probs.csv`
   - One row per group-stage match
   - Expected goals for each team (`lambda_home`, `lambda_away`)
   - Win/draw/loss probabilities (`p_home_win`, `p_draw`, `p_away_win`)

2. `wc2026_full_tournament_simulation_summary.csv`
   - One row per team
   - Group-stage expectations (points, goal difference, goals scored, position probabilities)
   - Knockout probabilities (reaching R32, R16, QF, SF, Final, Champion)


In [None]:
print("Group match probabilities:")
display(group_match_probs.describe())

print("\nFull tournament summary:")
display(tournament_summary.describe())

print("\nColumns in tournament_summary:")
print(tournament_summary.columns.tolist())


## 2. Tournament favourites – Win probabilities

We start with a simple ranking of teams by the probability of winning the World Cup (`prob_W`).

This gives a clear picture of which teams the model considers top contenders and how concentrated the title chances are.


In [None]:
# Sort teams by probability of winning the World Cup
favorites = tournament_summary.sort_values("prob_W", ascending=False).reset_index(drop=True)

# Take top 15 for a compact view
top_n = 15
favorites_top = favorites.head(top_n)

display(favorites_top[["team", "group", "prob_W", "prob_F", "prob_SF", "prob_QF"]])

# Bar plot of winner probabilities
plt.figure(figsize=(10, 6))
plt.barh(favorites_top["team"][::-1], favorites_top["prob_W"][::-1])
plt.xlabel("Probability of winning the World Cup")
plt.title("World Cup 2026 – Title probabilities (Top 15 teams)")
plt.tight_layout()
plt.show()


### Interpretation

Some points you might mention in your README / LinkedIn post:

- **Top tier:** Brazil, Spain, England, Argentina form the leading cluster with roughly 9–13% title probability each.
- **Next tier:** Portugal, France, Colombia, Belgium sit just below, reflecting strong recent performance but a slightly tougher path.
- **Dark horses:** Netherlands, Denmark, Uruguay, Croatia, Germany and Italy still have non-trivial paths to the trophy, but require multiple favourable results in high-variance knockout matches.
- The model reflects **realistic parity** at the very top: no team has an absurdly high (>30%) title probability, because the knockout bracket is random and strong teams can meet early.


## 3. Group-by-group qualification and finishing probabilities

For each group (A–L), we visualise:

- Probability of finishing **1st**, **2nd**, **3rd**, **4th**
- Probability of **qualifying for the knockout phase** (`prob_qual` – reaching the Round of 32)

This illustrates:
- Which groups are “groups of death”
- Which teams are overwhelming favourites vs balanced groups


In [None]:
def plot_group_overview(group_id: str, df: pd.DataFrame):
    """
    Plot finishing position & qualification probabilities for one group.
    """
    group_df = df[df["group"] == group_id].copy()
    group_df = group_df.sort_values("prob_1st", ascending=False)

    display(group_df[[
        "team", "exp_points", "exp_gd", "exp_gf",
        "prob_1st", "prob_2nd", "prob_3rd", "prob_4th", "prob_qual"
    ]])

    teams = group_df["team"]

    # Stacked bar for positions
    pos_matrix = group_df[["prob_1st", "prob_2nd", "prob_3rd", "prob_4th"]].values

    plt.figure(figsize=(10, 5))
    bottom = np.zeros(len(teams))
    labels = ["1st", "2nd", "3rd", "4th"]
    for i, label in enumerate(labels):
        plt.bar(teams, pos_matrix[:, i], bottom=bottom, label=label)
        bottom += pos_matrix[:, i]

    plt.ylabel("Probability")
    plt.title(f"Group {group_id} – Finishing position probabilities")
    plt.xticks(rotation=45, ha="right")
    plt.legend(title="Position")
    plt.tight_layout()
    plt.show()

    # Qualification probability
    plt.figure(figsize=(8, 4))
    plt.bar(teams, group_df["prob_qual"])
    plt.ylabel("Probability of reaching Round of 32")
    plt.title(f"Group {group_id} – Qualification probabilities")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()


In [None]:
all_groups = sorted(tournament_summary["group"].unique())
all_groups


In [None]:
for g in all_groups:
    print(f"\n=== Group {g} ===")
    plot_group_overview(g, tournament_summary)


## 4. Deep-run probabilities (SF, Final, Champion)

Next we focus on the probability that teams reach:

- At least the **Quarterfinals** (`prob_QF`)
- At least the **Semifinals** (`prob_SF`)
- The **Final** (`prob_F`)
- Win the **World Cup** (`prob_W`)

This highlights not only favourites to win, but also teams that are likely to make a deep run.


In [None]:
# Select relevant columns
deep_run = tournament_summary[[
    "team", "group", "prob_QF", "prob_SF", "prob_F", "prob_W"
]].copy()

# Sort by probability of reaching semi-finals as a proxy for deep run strength
deep_run_sorted = deep_run.sort_values("prob_SF", ascending=False).reset_index(drop=True)

display(deep_run_sorted.head(20))

# Stacked-ish bar chart for a subset (top 10–12 teams)
top_deep = deep_run_sorted.head(12)

x = np.arange(len(top_deep))
width = 0.2

plt.figure(figsize=(12, 6))
plt.bar(x - width, top_deep["prob_QF"], width, label="Reach QF")
plt.bar(x,         top_deep["prob_SF"], width, label="Reach SF")
plt.bar(x + width, top_deep["prob_F"],  width, label="Reach Final")

plt.xticks(x, top_deep["team"], rotation=45, ha="right")
plt.ylabel("Probability")
plt.title("Deep run probabilities – Top 12 teams")
plt.legend()
plt.tight_layout()
plt.show()


## 5. Match-level examples – Upsets & balanced fixtures

The Poisson model also gives **match-level** probabilities for each group-stage game:

- Expected goals per team (`lambda_home`, `lambda_away`)
- Win/draw/loss probabilities (`p_home_win`, `p_draw`, `p_away_win`)

We can use this to:
- Identify potential **upsets**
- Highlight the most **balanced fixtures**


In [None]:
# Copy with helper columns
m = group_match_probs.copy()
m["favoured_team"] = np.where(
    m["p_home_win"] > m["p_away_win"],
    m["home_team"],
    m["away_team"],
)
m["favoured_prob"] = m[["p_home_win", "p_away_win"]].max(axis=1)

# Potential upsets: favoured_prob not too high (say < 0.6)
possible_upsets = m.sort_values("favoured_prob").head(10)

print("Most balanced / potential upset matches:")
display(possible_upsets[[
    "date", "group", "home_team", "away_team",
    "lambda_home", "lambda_away",
    "p_home_win", "p_draw", "p_away_win",
    "favoured_team", "favoured_prob"
]])

## 6. Limitations & Modelling Choices

Some important caveats and modelling choices to note (for your README):

- **Data window:** Only matches from 2018–2025 are used, which reflects recent form but may overweight short-term trends.
- **Team-level model:** The Poisson model uses team-level attack/defence parameters and a global home advantage. It does not model:
  - Injuries
  - Squad rotation
  - Individual player strengths
- **Poisson assumptions:**
  - Goals are modelled as independent Poisson random variables for home and away teams.
  - This ignores tactical “game state” effects (e.g. teams shutting down at 1–0).
- **Knockout structure:**
  - A simplified 32-team bracket is used with random seeding of qualified teams.
  - Official FIFA 2026 third-place rules and fixed bracket positions are not replicated exactly (by design, to keep the project understandable and reproducible).
- **No betting odds or Elo ratings:**
  - The model is entirely driven by fixture outcomes and goal counts in international matches.
  - You can mention that this is a “purely data-driven Poisson model” without market information.

Despite these limitations, the pipeline is:
- fully reproducible,
- conceptually transparent, and
- strong enough for a GitHub portfolio project.


## 7. Possible Extensions

If you want to extend the project further:

1. **More realistic knockout bracket**
   - Implement the official FIFA 2026 structure with:
     - Fixed positions for group winners/runners-up
     - Correct mapping of third-place teams

2. **Time-decayed weights**
   - Give more weight to recent matches (e.g. exponential decay based on match date).

3. **Home/continent effects**
   - Add extra advantage for host confederation teams (USA, Mexico, Canada or CONCACAF).

4. **Interactive dashboard**
   - Build a Streamlit or Plotly Dash app that lets users:
     - Adjust number of simulations
     - Inspect probabilities by team or group
     - Explore hypothetical changes (e.g. injuries, form shocks)

5. **Comparison with betting odds**
   - Collect bookmaker implied probabilities before the tournament and compare:
     - Which teams the model over/underestimates relative to the market.
