<h1>
      <h4>
          <p style="font-size:36pt">MSIN0093: Business Strategy and Analytics</p>
          <p style="font-size:26pt">2022 FIFA World Cup winner predictive model and its further implications for sports businesses</p>
          <p><font color=blue size=26pt>Worksheet Notebook</font></p>
      </h4>
</h1>

# Table of Contents
* [Introduction](#Introduction)
* [Data sourcing & manipulation](#Data-sourcing-&-manipulation)
* [Descriptive analytics and Exploratory Data Analysis (EDA)](#Descriptive-analytics-and-Exploratory-Data-Analysis-(EDA))
   * [Descriptive analytics](#Descriptive-analytics)
   * [EDA](#EDA)
* [Data wrangling](#Data-source-and-wrangling)
   * [Variable selection & transformation](#Variable-selection-&-transformation)
   * [Data split: Training & Testing](#Data-split:-Training-&-Testing)
* [Model methodology](#Model-methodology)
   * [Model structure](#Model-structure)
     * [Group phase model](#Group-phase-model)
     * [Knockout phase model](#Knockout-phase-model)
     * [Estimate model](#Estimate-model)
     * [Define testing metrics](#Define-testing-metrics)
* [Results and model testing](#Results-and-model-testing)
   * [Simulation outcomes](#Simulation-outcomes)
   * [Backtesting](#Backtesting)
   * [K-fold validation](#K-fold-validation)
   * [Parameter tuning](#Parameter-tuning)
* [Model refitting, testing & evaluation](#Model-refitting,-testing-&-evaluation)
* [2022 FIFA World Cup winner prediction](#2022-FIFA-World-Cup-winner-prediction)
* [Discussion](#Discussion)
   * [Subsection 5](#Subsection-5)
   * [Subsection 6](#Subsection-6)
* [Limitations](#Limitations)
* [Appendix](#Appendix)
* [Conclusion](#Conclusion)
* [Reference list](#Reference-list)

## Introduction

Write some description about the background to our idea.

In [1]:
# Autosave
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."
%autosave 120

Autosaving every 120 seconds


In [2]:
# Import useful Python libraries
import os

import pandas as pd
import numpy as np

!pip install pandas_profiling
from pandas_profiling import ProfileReport

from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
#import sklearn.preprocessing as skl 
from scipy.stats import skellam

[1m
         .:::.     .::.       
        ....yy:    .yy.       
        :.  .yy.    y.        
             :y:   .:         
             .yy  .:          
              yy..:           
              :y:.            
              .y.             
             .:.              
        ....:.                
        :::.                  
[0;33m
• Project files and data should be stored in /project. This is shared among everyone
  in the project.
• Personal files and configuration should be stored in /home/faculty.
• Files outside /project and /home/faculty will be lost when this server is terminated.
• Create custom environments to setup your servers reproducibly.
[0m
Collecting pandas_profiling
  Using cached pandas_profiling-3.5.0-py2.py3-none-any.whl (325 kB)
Collecting phik<0.13,>=0.11.1
  Using cached phik-0.12.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (709 kB)
Collecting pydantic<1.11,>=1.8.1
  Using cached pydantic-1.10.2-cp39-cp39-manylinux_2_17_x86_64.

In [3]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [4]:
# Placeholder to define bespoke functions we will use in the notebook later


## Data sourcing & manipulation

#### Read in the original dataset from Kaggle containing match scores for all international matches (incl. friendlies).
[link: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017]
#### The modelling dataset should only contain past competitive matches of teams that participated in the last 2 FIFA World Cups (ie. Brazil 2014 and Russia 2018) and the upcoming one in Qatar.

In [5]:
# Read in the raw football dataset contain all international matches
raw_df = pd.read_csv("Data/kaggle_international_games_scores.csv")
raw_df.shape
print("The raw dataset has {0} rows and {1} columns.".format(raw_df.shape[0],raw_df.shape[1]))

The raw dataset has 44060 rows and 9 columns.


In [6]:
# Print columns
raw_df.columns

Index(['date', 'home_team', 'away_team', 'home_score', 'away_score',
       'tournament', 'city', 'country', 'neutral'],
      dtype='object')

In [7]:
# Print first 5 rows
raw_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0.0,0.0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4.0,2.0,Friendly,London,England,False
2,1874-03-07,Scotland,England,2.0,1.0,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2.0,2.0,Friendly,London,England,False
4,1876-03-04,Scotland,England,3.0,0.0,Friendly,Glasgow,Scotland,False


In [8]:
# Firstly we remove friendly games from the dataset
raw_df['tournament'].value_counts()

Friendly                                         17425
FIFA World Cup qualification                      7774
UEFA Euro qualification                           2593
African Cup of Nations qualification              1932
FIFA World Cup                                     900
Copa América                                       841
AFC Asian Cup qualification                        764
African Cup of Nations                             742
CECAFA Cup                                         620
CFU Caribbean Cup qualification                    606
Merdeka Tournament                                 595
British Championship                               505
UEFA Nations League                                468
Gulf Cup                                           380
AFC Asian Cup                                      370
Gold Cup                                           358
Island Games                                       350
UEFA Euro                                          337
COSAFA Cup

In [9]:
# Exclude friendlies
raw_df = raw_df.loc[raw_df.loc[:,'tournament']!='Friendly']
print(len(raw_df))

26635


In [10]:
# Then we would like to remove all games played before 1916 as we think that football matches before the 20th century are not an accurate reflection of modern day football
# Set the date to datetime variable
raw_df['date'] = pd.to_datetime(raw_df['date'], errors= 'raise')
raw_df['year'] = raw_df['date'].dt.year
raw_df = raw_df.loc[raw_df.loc[:,'year']>=1916]
print(len(raw_df))

26419


In [11]:
# Rename United States to USA in any columns which contains it
interim_df = raw_df.copy()
interim_df['home_team'] = np.where(interim_df['home_team']=='United States', 'USA', interim_df['home_team'])
interim_df['away_team'] = np.where(interim_df['away_team']=='United States', 'USA', interim_df['away_team'])
interim_df['country'] = np.where(interim_df['country']=='United States', 'USA', interim_df['country'])

### Now we filter for the countries that only participitated in the last World Cup (Russia 2018) and in the upcoming one in Qatar. 
### This is needed as we will have to estimate parameters for all of them in order to assess the accuracy of our model on the last tournament.

In [12]:
# The list of countries is the following
world_cup_countries = ['Qatar', 'Ecuador','Senegal','Netherlands','England', 'Iran','USA', 'Wales',
'Argentina', 'Saudi Arabia', 'Mexico','Poland', 'France', 'Australia', 'Denmark', 'Tunisia','Spain',
'Costa Rica', 'Germany', 'Japan', 'Belgium', 'Canada', 'Morocco', 'Croatia', 'Brazil', 'Serbia',
'Switzerland', 'Cameroon', 'Portugal', 'Ghana', 'Uruguay', 'South Korea', 'Russia', 'Egypt', 'Peru', 
'Nigeria', 'Iceland', 'Sweden', 'Panama', 'Colombia']

# We only select rows where either the home or away team is one of the coutries is in the above list
interim_df = interim_df.loc[(interim_df['home_team'].isin(world_cup_countries)) | (interim_df['away_team'].isin(world_cup_countries))] 
print(interim_df.shape)


(12265, 10)


In [13]:
# The list of tournaments that we care about (we will ignore small random competitions)
relevant_tournaments = ['FIFA World Cup qualification', 'FIFA World Cup', 'Copa América', 'Copa América qualification','UEFA Euro qualification',
'African Cup of Nations', 'African Cup of Nations qualification', 'AFC Asian Cup', 'UEFA Euro', 'Gold Cup', 'AFC Asian Cup qualification',
'CONCACAF Championship', 'African Nations Championship', 'UEFA Nations League', 'CONCACAF Championship qualification',
'African Nations Championship qualification', 'Gold Cup qualification', 'CONCACAF Nations League', 'Nations Cup'] 

# We only select rows where the match scores are from the above tournaments
interim_df = interim_df.loc[interim_df['tournament'].isin(relevant_tournaments)] 
print(interim_df.shape)

(9823, 10)


In [14]:
interim_df.head(10)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year
438,1916-07-02,Chile,Uruguay,0.0,4.0,Copa América,Buenos Aires,Argentina,True,1916
440,1916-07-06,Argentina,Chile,6.0,1.0,Copa América,Buenos Aires,Argentina,False,1916
441,1916-07-08,Brazil,Chile,1.0,1.0,Copa América,Buenos Aires,Argentina,True,1916
442,1916-07-10,Argentina,Brazil,1.0,1.0,Copa América,Buenos Aires,Argentina,False,1916
444,1916-07-12,Brazil,Uruguay,1.0,2.0,Copa América,Buenos Aires,Argentina,True,1916
446,1916-07-17,Argentina,Uruguay,0.0,0.0,Copa América,Avellaneda,Argentina,False,1916
471,1917-09-30,Uruguay,Chile,4.0,0.0,Copa América,Montevideo,Uruguay,False,1917
472,1917-10-03,Argentina,Brazil,4.0,2.0,Copa América,Montevideo,Uruguay,True,1917
473,1917-10-06,Argentina,Chile,1.0,0.0,Copa América,Montevideo,Uruguay,True,1917
476,1917-10-07,Uruguay,Brazil,4.0,0.0,Copa América,Montevideo,Uruguay,False,1917


In [15]:
# Next step is to identify which matches were from the group phase of a tournament or in the knockout
# We read in  two additional datasets (one containing world cup matches up to and including 2010 and one showing penatly shootouts which can only occur in the knockout phase)
penalty_df = pd.read_csv('Data/kaggle_international_game_penalty_shoutouts.csv')
alternative_df = pd.read_csv('Data/world_cup_matches_1930_to_2010.csv')

In [16]:
# Create variable for flag denoting group or knockout stage
knockout_phase = ['Quarter-finals', 'Round of 16', 'Semi-finals', 'Final', 'Match for third place']

alternative_df['tournament_phase'] = np.where(alternative_df['phase'].isin(knockout_phase), 'Knockout', 'Group')
print(alternative_df['tournament_phase'].value_counts())
alternative_df['date'] = pd.to_datetime(alternative_df['date'], errors='raise')
alternative_df.rename(columns={'home':'home_team', 'away':'away_team'}, inplace = True) #Rename column names for countries to allow for left join

Group       591
Knockout    181
Name: tournament_phase, dtype: int64


  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listl

In [17]:
penalty_df['tournament_phase'] = 'Knockout'
penalty_df['date'] = pd.to_datetime(penalty_df['date'], errors='raise')

In [18]:
merged_df = interim_df.merge(alternative_df[['date', 'tournament_phase', 'home_team', 'away_team']], how='left', on=['date', 'home_team', 'away_team'])
merged_df = merged_df.merge(penalty_df[['date', 'tournament_phase', 'home_team', 'away_team']], how='left', on=['date', 'home_team', 'away_team'])
merged_df['tournament_phase_x'].isnull().sum() # Still many missing values

9456

In [19]:
# Drop tournament_phase_y
merged_df.drop(columns=['tournament_phase_y'], inplace= True)
merged_df.rename(columns={'tournament_phase_x':'tournament_phase'}, inplace= True)

In [20]:
# Export interim dataset for now (Used for Rutvi's step)
#merged_df.to_csv('Interim_modelling_dataset.csv', index=False)

In [21]:
# Import Rutvi's dataset containing the tournament phase field
tournament_df = pd.read_csv("Data/final_tournament_phase_data.csv", index_col=0)
tournament_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,tournament_phase
1,7/15/2018,France,Croatia,4,2,FIFA World Cup,Moscow,Russia,True,Knockout
2,7/14/2018,Belgium,England,2,0,FIFA World Cup,Saint Petersburg,Russia,True,Knockout
3,7/11/2018,Croatia,England,2,1,FIFA World Cup,Moscow,Russia,True,Knockout
4,7/10/2018,France,Belgium,1,0,FIFA World Cup,Saint Petersburg,Russia,True,Knockout
5,7/7/2018,Sweden,England,0,2,FIFA World Cup,Samara,Russia,True,Knockout


In [22]:
# Rename United States to USA
tournament_df['home_team'] = np.where(tournament_df['home_team']=='United States', 'USA', tournament_df['home_team'])
tournament_df['away_team'] = np.where(tournament_df['away_team']=='United States', 'USA', tournament_df['away_team'])

In [23]:
# First convert date to datetime variable
tournament_df['date'] = pd.to_datetime(tournament_df['date'], errors= 'raise')
# Then we merge with main dataset by left join on date home team and away team
scores_df = merged_df.merge(tournament_df[['date','home_team','away_team','tournament_phase']], how='left', on = ['date','home_team','away_team'], suffixes=("", "_r"))
scores_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,tournament_phase_r
0,1916-07-02,Chile,Uruguay,0.0,4.0,Copa América,Buenos Aires,Argentina,True,1916,,Knockout
1,1916-07-06,Argentina,Chile,6.0,1.0,Copa América,Buenos Aires,Argentina,False,1916,,Knockout
2,1916-07-08,Brazil,Chile,1.0,1.0,Copa América,Buenos Aires,Argentina,True,1916,,Knockout
3,1916-07-10,Argentina,Brazil,1.0,1.0,Copa América,Buenos Aires,Argentina,False,1916,,Knockout
4,1916-07-12,Brazil,Uruguay,1.0,2.0,Copa América,Buenos Aires,Argentina,True,1916,,Knockout


In [24]:
# If there is null value in the tournament phase column, then we use the secondary from the other dataset
scores_df['tournament_phase'] = np.where(scores_df['tournament_phase'].isnull(), scores_df['tournament_phase_r'], scores_df['tournament_phase'])
scores_df['tournament_phase'].isnull().sum() # check if all the missing values have been filled now

0

In [25]:
# Drop secondary tournament phase column
scores_df.drop(columns=['tournament_phase_r'], inplace= True)

In [26]:
scores_df['tournament_phase'].value_counts()

Group       8618
Knockout    1205
Name: tournament_phase, dtype: int64

## Descriptive analytics and Exploratory Data Analysis (EDA)

### Descriptive analytics

In [27]:
# Check for nulls
scores_df.isnull().sum()

date                0
home_team           0
away_team           0
home_score          0
away_score          0
tournament          0
city                0
country             0
neutral             0
year                0
tournament_phase    0
dtype: int64

In [28]:
scores_df.describe()

Unnamed: 0,home_score,away_score,year
count,9823.0,9823.0,9823.0
mean,1.716176,1.082663,1995.584852
std,1.718676,1.304601,20.521785
min,0.0,0.0,1916.0
25%,1.0,0.0,1984.0
50%,1.0,1.0,2000.0
75%,2.0,2.0,2012.0
max,31.0,17.0,2022.0


In [29]:
# Convert float scores to integer
scores_df[['home_score', 'away_score']] = scores_df[['home_score', 'away_score']].astype(int)

In [30]:
# Majority of matches in the dataset are played in non-neutral grounds
scores_df['neutral'].value_counts()

False    7222
True     2601
Name: neutral, dtype: int64

In [31]:
scores_df['tournament'].value_counts()

FIFA World Cup qualification                  4196
UEFA Euro qualification                       1468
FIFA World Cup                                 817
Copa América                                   769
African Cup of Nations qualification           593
African Cup of Nations                         504
UEFA Euro                                      310
Gold Cup                                       284
AFC Asian Cup                                  243
AFC Asian Cup qualification                    184
UEFA Nations League                            172
CONCACAF Championship                           98
African Nations Championship                    70
CONCACAF Championship qualification             64
CONCACAF Nations League                         30
African Nations Championship qualification      12
Gold Cup qualification                           5
Nations Cup                                      3
Copa América qualification                       1
Name: tournament, dtype: int64

In [32]:
profile = ProfileReport(scores_df, title="Football Scores Profiling Report")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### EDA

In [33]:
# Plot the average number of home and away goals over time by tournament
sns.set(rc={'figure.figsize':(12,10)})
sns.set_style('white')

scores_df['year'] = scores_df['date'].dt.year
mean_home_goals = scores_df.groupby(['year','tournament']).mean()['home_score'].unstack()
mean_home_goals.plot()
plt.title('Average number of home goals by tournament over time', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the home side', size= 15)
plt.tight_layout()
plt.show()

In [34]:
# Plot the average number of home and away goals over time by tournament
mean_away_goals = scores_df.groupby(['year','tournament']).mean()['away_score'].unstack()
mean_away_goals.plot()
plt.title('Average number of away goals by tournament over time', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the away side', size= 15)
plt.tight_layout()
plt.show()

In [35]:
# Plot the average number of home and away goals over time by neutral ground
mean_home_goals = scores_df.groupby(['year','neutral']).mean()['home_score'].unstack()
mean_home_goals.plot()
plt.title('Average number of home goals over time (Neutral vs non-neutral ground)', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the home side', size= 15)
plt.tight_layout()
plt.show()

In [36]:
mean_away_goals = scores_df.groupby(['year','neutral']).mean()['away_score'].unstack()
mean_away_goals.plot()
plt.title('Average number of away goals over time (Neutral vs non-neutral ground)', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the away side', size= 15)
plt.tight_layout()
plt.show()

In [37]:
# Plot the average number of home and away goals over time by tournament phase
mean_home_goals = scores_df.groupby(['year','tournament_phase']).mean()['home_score'].unstack()
mean_home_goals.plot()
plt.title('Average number of home goals over time (Knockout vs group stage)', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the home side', size= 15)
plt.tight_layout()
plt.show()

In [38]:

mean_away_goals = scores_df.groupby(['year','tournament_phase']).mean()['away_score'].unstack()
mean_away_goals.plot()
plt.title('Average number of away goals over time (Knockout vs group stage)', size=20)
plt.xlabel('Times', size= 15)
plt.ylabel('Average goals for the away side', size= 15)
plt.tight_layout()
plt.show()

## Data wrangling

### Variable selection & transformation

In [39]:
penalty_df['home_team'] = np.where(penalty_df['home_team']=='United States', 'USA', penalty_df['home_team'])
penalty_df['away_team'] = np.where(penalty_df['away_team']=='United States', 'USA', penalty_df['away_team'])
penalty_df['penalty_shootout'] = 1
# Merge with the main dataset
scores_df = scores_df.merge(penalty_df[['date', 'home_team', 'away_team', 'winner','penalty_shootout']], how='left', on=['date', 'home_team', 'away_team'])
# Where penalty shootout is null on the main dataset, then set it to zero
scores_df['penalty_shootout'].fillna(0, inplace=True)
scores_df['penalty_shootout'] = scores_df['penalty_shootout'].astype(int)

In [40]:
# Create variables for goal difference and football outcome
scores_df['goal_difference']= scores_df['home_score'] - scores_df['away_score']

In [41]:
# Create winner & match outcome columns
scores_df['match_outcome'] = np.where(scores_df['goal_difference']>0 | ((scores_df['penalty_shootout']==1) & (scores_df['winner'].isin(scores_df['home_team']))),
                                      'Home win', np.where(scores_df['goal_difference']<0 | ((scores_df['penalty_shootout']==1) & (scores_df['winner']==scores_df['away_team'])),
                                      'Away win','Draw'))
scores_df['winner'] = np.where(scores_df['winner'].notnull(),scores_df['winner'], np.where(scores_df['goal_difference']>0,scores_df['home_team'],
                              np.where(scores_df['goal_difference']<0,scores_df['away_team'],'Draw')))

In [42]:
# Create a continent variable to derive the continent dummies
country_continent_df = pd.read_csv("Data/countryContinent.csv") # Read in a country to continent mapping sourced from Kaggle
# In international football competitions, countries from Oceania and Asia play in the same qualifiers so we group them together
country_continent_df['continent_football'] = np.where((country_continent_df['continent']=='Asia') | 
                                                      (country_continent_df['continent']=='Oceania'), 'Asia & Oceania', country_continent_df['continent'])
# Also separating Middle & North America from South America because of distinct football federations
country_continent_df['continent_football'] = np.where((country_continent_df['sub_region']=='Central America') | 
                                                      (country_continent_df['sub_region']=='Northern America') |
                                                      (country_continent_df['sub_region']=='Caribbean'), 'North & Central America', 
                                                      np.where(country_continent_df['sub_region']=='South America','South America', country_continent_df['continent_football']))
country_continent_df['continent_football'].value_counts()


Asia & Oceania             76
Africa                     58
Europe                     51
North & Central America    41
South America              14
Name: continent_football, dtype: int64

In [43]:
country_continent_df['country'] = np.where(country_continent_df['country']=='United States of America', 'USA', country_continent_df['country'])
scores_df = scores_df.merge(country_continent_df[['country', 'continent_football']], how='left', on='country')
scores_df['continent_football'].isnull().sum()

1444

In [44]:
 # Still some missing continent due to old names of countries or countries which play separately like England, Wales, etc
# European countries
scores_df['continent_football'] = np.where((scores_df['country']=='England') | 
                                                       (scores_df['country']=='Wales') | (scores_df['country']=='Scotland') |
                                                       (scores_df['country']=='Yugoslavia') | (scores_df['country']=='Soviet Union')|
                                                       (scores_df['country']=='German DR') | (scores_df['country']=='Irish Free State') |
                                                       (scores_df['country']=='Saarland') | (scores_df['country']=='Czechoslovakia') |
                                                       (scores_df['country']=='Russia') | (scores_df['country']=='North Macedonia') |
                                                       (scores_df['country']=='Moldova') | (scores_df['country']=='Serbia and Montenegro') |
                                                       (scores_df['country']=='Kosovo') | (scores_df['country']=='Northern Ireland')|
                                                       (scores_df['country']=='Republic of Ireland'),'Europe', scores_df['continent_football'])

# Asian countries
scores_df['continent_football'] = np.where((scores_df['country']=='South Korea') | (scores_df['country']=='Iran')
                                                        | (scores_df['country']=='China PR') | (scores_df['country']=='Vietnam') | (scores_df['country']=='Syria')
                                                        | (scores_df['country']=='Palestine') | (scores_df['country']=='Laos') | (scores_df['country']=='Taiwan')
                                                        | (scores_df['country']=='Macau') | (scores_df['country']=='United Arab Republic') | (scores_df['country']=='Yemen AR')
                                                        | (scores_df['country']=='East Timor') | (scores_df['country']=='North Korea'), 'Asia & Oceania', scores_df['continent_football'])

# South America
scores_df['continent_football'] = np.where((scores_df['country']=='Bolivia') | (scores_df['country']=='Venezuela'), 'South America', scores_df['continent_football'])

# Africa
scores_df['continent_football'] = np.where((scores_df['country']=='Ivory Coast') | (scores_df['country']=='Tanzania') | (scores_df['country']=='Zaïre') | (scores_df['country']=='DR Congo')
                                                        | (scores_df['country']=='Cape Verde') | (scores_df['country']=='Eswatini') | (scores_df['country']=='Upper Volta') | (scores_df['country']=='Dahomey'), 'Africa',
                                                        scores_df['continent_football'])

# North & Central America
scores_df['continent_football'] = np.where((scores_df['country']=='Netherlands Antilles') | (scores_df['country']=='São Tomé and Príncipe'), 'North & Central America', scores_df['continent_football'])

In [45]:
print(scores_df['continent_football'].isnull().sum())
scores_df['continent_football'].value_counts()

0


Europe                     3818
Africa                     1831
South America              1601
Asia & Oceania             1447
North & Central America    1126
Name: continent_football, dtype: int64

In [46]:
# Adjust continent variable to indicate whether the match was played in an international scene or at some regional stage
scores_df['continent_football'] = np.where(scores_df['tournament']=='FIFA World Cup', 'International', scores_df['continent_football'])

### Create dummy variables for continents

In [47]:
# Here the reference continent will actually be international games
continent_dummies = pd.get_dummies(scores_df['continent_football'], prefix='Continent')
continent_dummies.drop(columns=['Continent_International'], inplace= True)
# Create dummy for home advantage
home_advantage = pd.get_dummies(scores_df['neutral'], prefix='Home_advantage', drop_first= True)
# Append the dummy columns to the main modelling dataset
scores_df = pd.concat([scores_df, continent_dummies, home_advantage], axis=1)
scores_df.head()


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True
0,1916-07-02,Chile,Uruguay,0,4,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-4,Away win,South America,0,0,0,0,1,1
1,1916-07-06,Argentina,Chile,6,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Argentina,0,5,Home win,South America,0,0,0,0,1,0
2,1916-07-08,Brazil,Chile,1,1,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,1
3,1916-07-10,Argentina,Brazil,1,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,0
4,1916-07-12,Brazil,Uruguay,1,2,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-1,Away win,South America,0,0,0,0,1,1


In [48]:
# Let's export the modelling dataset MDS
#scores_df.to_csv('Data/FIFA_modelling_dataset.csv', index=False)

In [49]:
# Check final shape
scores_df.shape

(9823, 22)

In [50]:
scores_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True
0,1916-07-02,Chile,Uruguay,0,4,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-4,Away win,South America,0,0,0,0,1,1
1,1916-07-06,Argentina,Chile,6,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Argentina,0,5,Home win,South America,0,0,0,0,1,0
2,1916-07-08,Brazil,Chile,1,1,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,1
3,1916-07-10,Argentina,Brazil,1,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,0
4,1916-07-12,Brazil,Uruguay,1,2,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-1,Away win,South America,0,0,0,0,1,1


#### Some more descriptive analytics on the final MDS

In [51]:
# Countries with most wins
scores_df['winner'].value_counts()

Draw                        2058
Brazil                       280
Argentina                    264
Germany                      262
Spain                        241
Mexico                       232
Netherlands                  226
Uruguay                      218
England                      209
France                       204
Russia                       197
Portugal                     196
Nigeria                      186
Belgium                      185
Sweden                       178
Egypt                        171
Iran                         170
USA                          169
South Korea                  166
Cameroon                     163
Denmark                      153
Morocco                      152
Ghana                        152
Poland                       141
Tunisia                      141
Costa Rica                   139
Japan                        138
Switzerland                  135
Saudi Arabia                 135
Colombia                     128
Senegal   

In [52]:
# Most home wins
scores_df.loc[scores_df.loc[:,'match_outcome']=='Home win']['winner'].value_counts()

Argentina                 203
Brazil                    197
Germany                   147
Mexico                    146
USA                       137
Netherlands               132
Spain                     131
Egypt                     125
Nigeria                   125
France                    125
Russia                    117
England                   116
Belgium                   113
Portugal                  112
Cameroon                  111
Morocco                   110
Sweden                    108
Uruguay                   106
South Korea               102
Iran                      101
Ghana                      96
Saudi Arabia               94
Costa Rica                 94
Tunisia                    88
Denmark                    85
Poland                     81
Colombia                   80
Switzerland                79
Senegal                    78
Australia                  76
Japan                      74
Peru                       70
Chile                      68
Qatar     

In [53]:
# Most away wins
scores_df.loc[scores_df.loc[:,'match_outcome']=='Away win']['winner'].value_counts()

Germany                     111
Uruguay                     109
Spain                       108
Netherlands                  93
England                      92
Portugal                     83
Mexico                       83
Russia                       79
Brazil                       77
France                       77
Belgium                      72
Sweden                       70
Iran                         68
Denmark                      68
South Korea                  62
Japan                        61
Poland                       60
Nigeria                      59
Argentina                    58
Switzerland                  56
Croatia                      56
Ghana                        55
Cameroon                     51
Tunisia                      50
Colombia                     47
Senegal                      45
Peru                         44
Costa Rica                   44
Egypt                        43
Morocco                      42
Saudi Arabia                 40
Serbia  

In [54]:
# Most home goals scored
scores_df.groupby('home_team')['home_score'].sum().sort_values(ascending= False)

home_team
Brazil                              708
Argentina                           686
Germany                             554
Mexico                              511
Spain                               484
Netherlands                         479
France                              433
England                             405
USA                                 400
Belgium                             397
Egypt                               379
Portugal                            375
South Korea                         370
Russia                              368
Sweden                              347
Nigeria                             344
Iran                                339
Australia                           333
Uruguay                             324
Saudi Arabia                        318
Cameroon                            309
Denmark                             307
Morocco                             305
Costa Rica                          297
Poland                        

In [55]:
# Most away goals scored
scores_df.groupby('away_team')['away_score'].sum().sort_values(ascending= False)

away_team
Uruguay                             413
Germany                             369
England                             327
Spain                               324
Netherlands                         313
Mexico                              279
Russia                              273
Portugal                            261
Sweden                              257
Belgium                             253
Brazil                              252
Iran                                245
France                              245
Denmark                             236
Peru                                218
Poland                              216
Switzerland                         209
South Korea                         204
Japan                               194
Nigeria                             192
Argentina                           187
Costa Rica                          181
Tunisia                             177
Ghana                               168
Colombia                      

In [56]:
# Goals conceded at home by away teams
# Most home goals scored
scores_df.groupby('home_team')['away_score'].sum().sort_values(ascending= True)

home_team
Eritrea                               0
Comoros                               1
Antigua and Barbuda                   2
Puerto Rico                           3
Saarland                              3
Bhutan                                3
Tahiti                                3
Djibouti                              3
Central African Republic              4
Fiji                                  4
French Guiana                         4
Solomon Islands                       4
Myanmar                               4
Belize                                4
Vietnam Republic                      4
Palestine                             5
Guadeloupe                            5
Seychelles                            5
Saint Kitts and Nevis                 5
Guam                                  6
Saint Vincent and the Grenadines      6
Curaçao                               6
Barbados                              6
Gambia                                7
Aruba                         

In [57]:
# Goals conceded by home teams
# Most home goals scored
scores_df.groupby('away_team')['home_score'].sum().sort_values(ascending= True)

away_team
Puerto Rico                           0
Central African Republic              2
French Guiana                         3
Dominican Republic                    3
Saarland                              3
Vanuatu                               3
Antigua and Barbuda                   3
São Tomé and Príncipe                 5
Timor-Leste                           7
Saint Lucia                           8
Saint Vincent and the Grenadines      9
Yemen DPR                             9
Seychelles                            9
Brunei                                9
Myanmar                              11
Dominica                             11
Vietnam Republic                     11
Samoa                                11
Djibouti                             12
Saint Kitts and Nevis                12
Anguilla                             13
Somalia                              13
Botswana                             14
Eswatini                             14
Guadeloupe                    

### Data split: Training & Testing

The idea here is that we want to measure our model's predictive accuracy on the most revent World Cup to date which was in 2018. The remaining data points will be used to train the model and estimate the model coefficients.

We will have two differents models: one Poisson regression for the group stage and one Logistic regression for the knockouts. This means that there will be **2** training samples and **2** tests sets.

In [58]:
# Create a copy from our initial dataset created before
split_group_df = scores_df.copy()

In [59]:
#here we filtered all the observations that are in group stages or knockout stages resulting in penalties
# Group stage subset
group_stage_df = split_group_df.loc[(split_group_df.loc[:,'tournament_phase']=='Group') | (split_group_df.loc[:,'tournament_phase']=='Knockout') & (split_group_df.loc[:,'penalty_shootout']==1)]
# Knockout stage subset
knockout_stage_df = split_group_df.loc[(split_group_df.loc[:,'tournament_phase']=='Knockout') & (split_group_df.loc[:,'penalty_shootout']==0)]

In [60]:
# Now we split the group stage subset between training and test set
test_group_df = group_stage_df.loc[(group_stage_df.loc[:,'tournament']=='FIFA World Cup') & (group_stage_df.loc[:,'year']==2018) & (split_group_df.loc[:,'tournament_phase']=='Group')]
print("The test set for the Poisson model has {} rows and {} columns.".format(test_group_df.shape[0],test_group_df.shape[1]))
train_group_df = group_stage_df.drop(group_stage_df[((group_stage_df['tournament'] == 'FIFA World Cup') & (group_stage_df['year'] == 2018))].index)
print("The training set for the Poisson model has {} rows and {} columns.".format(train_group_df.shape[0],train_group_df.shape[1]))

The test set for the Poisson model has 48 rows and 22 columns.
The training set for the Poisson model has 8705 rows and 22 columns.


In [61]:
# Next up we split the knockout stage subset between training and test set
test_knockout_df = knockout_stage_df.loc[(knockout_stage_df.loc[:,'tournament']=='FIFA World Cup') & (knockout_stage_df.loc[:,'year']==2018) & (split_group_df.loc[:,'tournament_phase']=='Knockout')]
print("The test set for the Logistic model has {} rows and {} columns.".format(test_knockout_df.shape[0],test_knockout_df.shape[1]))
train_knockout_df = knockout_stage_df.drop(knockout_stage_df[((knockout_stage_df['tournament'] == 'FIFA World Cup') & (knockout_stage_df['year'] == 2018))].index)
print("The training set for the Logistic model has {} rows and {} columns.".format(train_knockout_df.shape[0],train_knockout_df.shape[1]))

The test set for the Logistic model has 12 rows and 22 columns.
The training set for the Logistic model has 1054 rows and 22 columns.


### Creating an alternative training dataset for the Poisson and Logistic models which only include countries with at least 20 home and 20 away games (5 for the Logistic model due to data constraints)

#### The purpose of this is that the threshold removes countries that have played very few games and it ensures that the model specifcation is parsimonious.

#### Poisson

In [62]:
# home
match_volumes_home = train_group_df.groupby('home_team')['home_score'].count().reset_index()
match_volumes_home = pd.DataFrame(match_volumes_home).reset_index()
match_volumes_home.rename(columns={'home_score':'home_counts'}, inplace= True)
train_group_df = train_group_df.merge(match_volumes_home, how='left',on='home_team')

# away
match_volumes_away = train_group_df.groupby('away_team')['away_score'].count()
match_volumes_away = pd.DataFrame(match_volumes_away).reset_index()
match_volumes_away.rename(columns={'away_score':'away_counts'}, inplace= True)
train_group_df = train_group_df.merge(match_volumes_away, how='left',on='away_team')

In [63]:
# Filter on teams which have at least 20 data points
train_group_constrained_df = train_group_df.loc[(train_group_df.loc[:,'home_counts']>=20) & (train_group_df.loc[:,'away_counts']>=20)]
print("The loss in observations from this filtering is {}.".format(len(train_group_df) - len(train_group_constrained_df)))

The loss in observations from this filtering is 1395.


In [64]:
# Drop the redundant columns
train_group_constrained_df.drop(columns=['home_counts', 'away_counts', 'index'], inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_group_constrained_df.drop(columns=['home_counts', 'away_counts', 'index'], inplace= True)


#### Logistic

In [65]:
# home
match_volumes_home = train_knockout_df.groupby('home_team')['home_score'].count().reset_index()
match_volumes_home = pd.DataFrame(match_volumes_home).reset_index()
match_volumes_home.rename(columns={'home_score':'home_counts'}, inplace= True)
train_knockout_df = train_knockout_df.merge(match_volumes_home, how='left',on='home_team')

# away
match_volumes_away = train_knockout_df.groupby('away_team')['away_score'].count()
match_volumes_away = pd.DataFrame(match_volumes_away).reset_index()
match_volumes_away.rename(columns={'away_score':'away_counts'}, inplace= True)
train_knockout_df = train_knockout_df.merge(match_volumes_away, how='left',on='away_team')

In [66]:
# Filter on teams which have at least 5 data points
train_knockout_constrained_df = train_knockout_df.loc[(train_knockout_df.loc[:,'home_counts']>=5) & (train_knockout_df.loc[:,'away_counts']>=5)]
print("The loss in observations from this filtering is {}.".format(len(train_knockout_df) - len(train_knockout_constrained_df)))

The loss in observations from this filtering is 190.


In [67]:
# Drop the redundant columns
train_knockout_constrained_df.drop(columns=['home_counts', 'away_counts', 'index'], inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_knockout_constrained_df.drop(columns=['home_counts', 'away_counts', 'index'], inplace= True)


# Model methodology

## Model structure

## Model 1 for Group stage phase: Poisson regression
In order to predict the winner of the group stage, we will predict the score of the matches via two Poisson models.

$$ X_k \sim Poisson(\lambda_k) $$

$$ Y_k \sim Poisson(\mu_k) $$

$ X_k $ denotes the home team scoring rate which is a random variable that follows a Poisson distribution. Similarly, $Y_k$ denotes the away team scoring rate which has the same properties as $ X_k $.

$ X_k $ and $Y_k$ will be predicted via this model specification:

$$ X_k = intercept + alpha_i + beta_j + continent_i + home $$
$$ Y_k = intercept + alpha_j + beta_i + continent_i + home $$

The coefficients in this model can be interpreted in a relatively straightforward manner as follows:

1. $\alpha_i$ can be thought of as a coefficient that determines the attacking strength of team $i$, which should directly affect the goal scoring rate in a match. 

2. Moreover, $\beta_j$ can be viewed as a coefficient that determines the defensive ability of team $j$, which again affects the goal scoring rate. 

3. Moving to $intercept$, we can interpret this parameter as some sort of baseline for the overall rate of goals scored in general in a particular football league (similar to an intercept term in linear regression).

4. We also have parameters $continent_i$, which can be interpreted as the region/ continent dummy variables controllinng for region-specific characteristics (eg. football tournaments in South America may have a higher goal-scoring rate than other geographies). For FIFA World Cup games, these are assumed to be international and this is the goal-scoring rate of baseline category which is represented by $intercept$.

5. Finally, we can view the $home$ term as a coefficient that determines the home advantage of a team. 

In [68]:
train_group_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True,index,home_counts,away_counts
0,1922-09-17,Brazil,Chile,1,1,Copa América,Rio de Janeiro,Brazil,False,1922,Group,Draw,0,0,Draw,South America,0,0,0,0,1,0,24,204,100
1,1922-09-23,Chile,Uruguay,0,2,Copa América,Rio de Janeiro,Brazil,True,1922,Group,Uruguay,0,-2,Away win,South America,0,0,0,0,1,1,35,102,166
2,1922-09-24,Brazil,Paraguay,1,1,Copa América,Rio de Janeiro,Brazil,False,1922,Group,Draw,0,0,Draw,South America,0,0,0,0,1,0,24,204,119
3,1926-10-16,Argentina,Bolivia,5,0,Copa América,Santiago,Chile,True,1926,Group,Argentina,0,5,Home win,South America,0,0,0,0,1,1,6,204,91
4,1930-07-13,Belgium,USA,0,3,FIFA World Cup,Montevideo,Uruguay,True,1930,Group,USA,0,-3,Away win,International,0,0,0,0,0,1,16,165,110


In [69]:
# Find the countries that have played both home and only away matches and only keep these in the dataset
countries_to_keep = set(train_group_df['home_team']) & set(train_group_df['away_team'])
train_group_df = train_group_df.loc[(train_group_df.loc[:,'home_team'].isin(countries_to_keep)) & (train_group_df.loc[: ,'away_team'].isin(countries_to_keep))]
print(len(train_group_df))
print(len(countries_to_keep))

8675
197


#### Estimation of Poisson models

In [70]:
# Creating our design matrix and target vector 
goals = np.hstack([train_group_df['home_score'].values, train_group_df['away_score'].values])
X_home = pd.concat([pd.get_dummies(train_group_df['home_team'], prefix='Home'), pd.get_dummies(train_group_df['away_team'], prefix='Away')], axis=1)

# For modelling away goals, we basically duplicate the dataframe but switch the coefficient indices --> for example (1,5) will become (5,1)
X_away = []
for i in range(X_home.shape[0]):
    row_i = np.zeros_like(X_home.values[i, :])
    ind_1 , ind_2 = np.where(X_home.values[i, :] == 1)[0]
    row_i[ind_1 + 197] = 1 
    row_i[ind_2 - 197] = 1 
    X_away.append(row_i)
X_away = np.vstack(X_away)

X_away = pd.DataFrame(X_away, columns=X_home.columns)
X_home['intercept'] = 1
X_home['home'] = 1
X_home[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X_away['intercept'] = 1
X_away['home'] = -1
X_away[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X = pd.concat([X_home, X_away], axis=0)

X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']].apply(np.uint8)

## Group phase model

### Candidate model 1 (Home advantage is included)

In [71]:
# Add the identifiability restrictions as a linear system of equations
alpha_restr = np.concatenate([np.ones(shape=197), np.zeros(shape=204)])
beta_restr = np.concatenate([np.zeros(shape=197), np.ones(shape=197), np.zeros(shape=7)])
restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 
model_group = sm.GLM(goals, X, family=sm.families.Poisson()).fit_constrained((restr , np.zeros(shape=2))) # restrictions are of the form Aw=0 (where w are the model parameters)
print(f'Log-likelihood of the model is: {model_group.llf:.2f}\n\n')
model_group.summary()

Log-likelihood of the model is: -24255.92




0,1,2,3
Dep. Variable:,y,No. Observations:,17350.0
Model:,GLM,Df Residuals:,16953.0
Model Family:,Poisson,Df Model:,396.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-24256.0
Date:,"Tue, 29 Nov 2022",Deviance:,19730.0
Time:,23:37:19,Pearson chi2:,18400.0
No. Iterations:,1,Pseudo R-squ. (CS):,0.412
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Home_Afghanistan,0.0246,1068.895,2.3e-05,1.000,-2094.970,2095.019
Home_Albania,1.5451,1068.894,0.001,0.999,-2093.449,2096.539
Home_Algeria,1.9991,1068.894,0.002,0.999,-2092.995,2096.993
Home_Andorra,0.2154,1068.894,0.000,1.000,-2094.779,2095.210
Home_Angola,1.6929,1068.894,0.002,0.999,-2093.301,2096.687
Home_Antigua and Barbuda,1.6781,1068.895,0.002,0.999,-2093.317,2096.673
Home_Argentina,2.4047,1068.894,0.002,0.998,-2092.590,2097.399
Home_Armenia,1.4113,1068.894,0.001,0.999,-2093.583,2096.406
Home_Australia,1.8092,1068.894,0.002,0.999,-2093.185,2096.804


#### Some checks to see if the model has been correctly fitted

In [72]:
# Sum of all attacking and defending coefficients must be equal to or at least very close to zero
# Attacking abilities
print(model_group.params[0:197].values.sum())
# Defending abilities
print(model_group.params[197:394].values.sum())

-4.263256414560601e-14
-1.1546319456101628e-14


In [73]:
model_group.params.sort_values(ascending= False)

Home_Germany                              2.632774
Home_Spain                                2.550588
Home_Yugoslavia                           2.529407
Home_Brazil                               2.508551
Home_Czechoslovakia                       2.496375
Home_England                              2.492178
Home_Netherlands                          2.484260
Home_German DR                            2.422143
Home_France                               2.408154
Home_Argentina                            2.404687
Home_Belgium                              2.375999
Home_Portugal                             2.361622
Home_Hungary                              2.360939
Home_Russia                               2.350655
Home_Serbia                               2.323224
Home_Denmark                              2.312045
Home_Poland                               2.307658
Home_Sweden                               2.304390
Home_Italy                                2.274427
Home_Croatia                   

The first candidate model has too many independent variables and most of the coefficients are not statistically significant. This means that it is not a well-estimated model and other candidates need to be explored.

### Candidate model 2 (Home advantage is excluded)

In [74]:
goals = np.hstack([train_group_df['home_score'].values, train_group_df['away_score'].values])
X_home = pd.concat([pd.get_dummies(train_group_df['home_team'], prefix='Home'), pd.get_dummies(train_group_df['away_team'], prefix='Away')], axis=1)

# For modelling away goals, we basically duplicate the dataframe but switch the coefficient indices --> for example (1,5) will become (5,1)
X_away = []
for i in range(X_home.shape[0]):
    row_i = np.zeros_like(X_home.values[i, :])
    ind_1 , ind_2 = np.where(X_home.values[i, :] == 1)[0]
    row_i[ind_1 + 197] = 1 
    row_i[ind_2 - 197] = 1 
    X_away.append(row_i)
X_away = np.vstack(X_away)

X_away = pd.DataFrame(X_away, columns=X_home.columns)
X_home['intercept'] = 1
X_home[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X_away['intercept'] = 1
X_away[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X = pd.concat([X_home, X_away], axis=0)

X[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = X[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']].apply(np.uint8)

In [75]:
# Add the identifiability restrictions as a linear system of equations
alpha_restr = np.concatenate([np.ones(shape=197), np.zeros(shape=203)])
beta_restr = np.concatenate([np.zeros(shape=197), np.ones(shape=197), np.zeros(shape=6)])
restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 
model_group_two = sm.GLM(goals, X, family=sm.families.Poisson()).fit_constrained((restr , np.zeros(shape=2))) # restrictions are of the form Aw=0 (where w are the model parameters)
print(f'Log-likelihood of the model is: {model_group_two.llf:.2f}\n\n')
model_group_two.summary()

Log-likelihood of the model is: -24728.10




0,1,2,3
Dep. Variable:,y,No. Observations:,17350.0
Model:,GLM,Df Residuals:,16954.0
Model Family:,Poisson,Df Model:,395.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-24728.0
Date:,"Tue, 29 Nov 2022",Deviance:,20674.0
Time:,23:37:26,Pearson chi2:,19300.0
No. Iterations:,1,Pseudo R-squ. (CS):,0.3791
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Home_Afghanistan,-0.0793,1077.957,-7.36e-05,1.000,-2112.837,2112.678
Home_Albania,1.5621,1077.957,0.001,0.999,-2111.195,2114.319
Home_Algeria,2.0015,1077.957,0.002,0.999,-2110.756,2114.759
Home_Andorra,0.2297,1077.957,0.000,1.000,-2112.528,2112.987
Home_Angola,1.6912,1077.957,0.002,0.999,-2111.066,2114.449
Home_Antigua and Barbuda,1.7116,1077.957,0.002,0.999,-2111.046,2114.469
Home_Argentina,2.5057,1077.957,0.002,0.998,-2110.252,2115.263
Home_Armenia,1.4402,1077.957,0.001,0.999,-2111.317,2114.198
Home_Australia,1.8067,1077.957,0.002,0.999,-2110.951,2114.564


Excluding the home advantage dummy does not make any difference to the statistical significance of the attacking ability coefficients ($\alpha$).

We, therefore, consider another alternative model.

### Candidate model 3 (Home advantage is included but we only fit a model for countries which have played at least 20 home and 20 away games)
This is done to ensure that there are enough data points to stably estimate the parameters.

In [76]:
# Find the countries that have played both home and only away matches and only keep these in the dataset
countries_to_keep = set(train_group_constrained_df['home_team']) & set(train_group_constrained_df['away_team'])
train_group_constrained_df = train_group_constrained_df.loc[(train_group_constrained_df.loc[:,'home_team'].isin(countries_to_keep)) & (train_group_constrained_df.loc[: ,'away_team'].isin(countries_to_keep))]
print(len(train_group_constrained_df))
print(len(countries_to_keep))

7077
104


In [77]:
# Creating a different design matrix and target vector compared to model 1
goals = np.hstack([train_group_constrained_df['home_score'].values, train_group_constrained_df['away_score'].values])
X_home = pd.concat([pd.get_dummies(train_group_constrained_df['home_team'], prefix='Home'), pd.get_dummies(train_group_constrained_df['away_team'], prefix='Away')], axis=1)

# For modelling away goals, we basically duplicate the dataframe but switch the coefficient indices --> for example (1,5) will become (5,1)
X_away = []
for i in range(X_home.shape[0]):
    row_i = np.zeros_like(X_home.values[i, :])
    ind_1 , ind_2 = np.where(X_home.values[i, :] == 1)[0]
    row_i[ind_1 + 104] = 1 
    row_i[ind_2 - 104] = 1 
    X_away.append(row_i)
X_away = np.vstack(X_away)

X_away = pd.DataFrame(X_away, columns=X_home.columns)
X_home['intercept'] = 1
X_home['home'] = 1
X_home[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_constrained_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X_away['intercept'] = 1
X_away['home'] = -1
X_away[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_constrained_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X = pd.concat([X_home, X_away], axis=0)

X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']].apply(np.uint8)

In [78]:
# Add the identifiability restrictions as a linear system of equations
alpha_restr = np.concatenate([np.ones(shape=104), np.zeros(shape=111)])
beta_restr = np.concatenate([np.zeros(shape=104), np.ones(shape=104), np.zeros(shape=7)])
restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 
model_group_constrained = sm.GLM(goals, X, family=sm.families.Poisson()).fit_constrained((restr , np.zeros(shape=2))) # restrictions are of the form Aw=0 (where w are the model parameters)
print(f'Log-likelihood of the model is: {model_group_constrained.llf:.2f}\n\n')
model_group_constrained.summary()

Log-likelihood of the model is: -19981.09




0,1,2,3
Dep. Variable:,y,No. Observations:,14154.0
Model:,GLM,Df Residuals:,13943.0
Model Family:,Poisson,Df Model:,210.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-19981.0
Date:,"Tue, 29 Nov 2022",Deviance:,16343.0
Time:,23:37:28,Pearson chi2:,15200.0
No. Iterations:,1,Pseudo R-squ. (CS):,0.3319
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Home_Albania,-0.2712,0.137,-1.974,0.048,-0.540,-0.002
Home_Algeria,0.1505,0.128,1.174,0.240,-0.101,0.402
Home_Andorra,-1.6085,0.375,-4.285,0.000,-2.344,-0.873
Home_Argentina,0.5815,0.053,10.957,0.000,0.477,0.686
Home_Armenia,-0.4115,0.187,-2.205,0.027,-0.777,-0.046
Home_Australia,-0.0413,0.097,-0.424,0.671,-0.232,0.149
Home_Austria,0.2940,0.099,2.971,0.003,0.100,0.488
Home_Azerbaijan,-0.6710,0.219,-3.070,0.002,-1.099,-0.243
Home_Bahrain,-0.7238,0.193,-3.753,0.000,-1.102,-0.346


In [79]:
# Sum of all attacking and defending coefficients must be equal to or at least very close to zero
# Attacking abilities
print(model_group_constrained.params[0:104].values.sum())
# Defending abilities
print(model_group_constrained.params[104:208].values.sum())

-1.9984014443252818e-15
1.3322676295501878e-15


In [80]:
model_group_constrained.params.sort_values(ascending= False)

Away_San Marino                      1.090930
Home_Germany                         0.822861
Home_Spain                           0.739733
Away_Liechtenstein                   0.718645
Home_Yugoslavia                      0.710390
Home_Brazil                          0.689192
Home_Netherlands                     0.676682
Home_Czechoslovakia                  0.668473
Home_England                         0.664766
Away_Malta                           0.655687
Away_Andorra                         0.625132
Home_France                          0.601388
Home_German DR                       0.595577
Away_Luxembourg                      0.593405
Away_Faroe Islands                   0.583205
Home_Argentina                       0.581492
Away_Uzbekistan                      0.575933
Home_Belgium                         0.543513
Home_Portugal                        0.542287
Home_Russia                          0.541275
Home_Hungary                         0.540269
Away_Qatar                        

In [81]:
# Comparing the Bayesian Information Criterion
# a better model should have a lower BIC
print(f'Model 1: BIC is {model_group.bic_llf:.2f}');
print(f'Model 3: BIC is {model_group_constrained.bic_llf:.2f}');

Model 1: BIC is 52387.10
Model 3: BIC is 41978.86


##### Model 3 not only has highly statistically significant coefficients for most variables, it also produces a Bayesian Information Criterion (BIC) which is lower than that of model 1. BIC measures the combined explanatory power of the model accounting for the number of independent variables it has (similar to the adjusted $R^{2}$ for linear models). We, therefore, take model 3 forward.

#### Interpreting the model coefficients

We begin by stating that the coefficients starting with 'Home_' represent the $\alpha$ coefficients and the 'Away_' represent the $\beta$ coefficients. Note also that the values of these coefficients are relative, and depend on the identifiability assumptions included in this model.

We can see quite a bit of variation in the sign, magnitude and statistical significance of the model's coefficients. For example, the Home coefficient of Brazil, is positive and statistically significant, with a p-value of essentially zero. This suggests that the attacking strength of Brazil is captured by the model and can be used to explain its scoring rate. Similar interpretations can be easily deduced for the other coefficients as well. 

Let us demonstrate how one can interpret the meaning of these coefficients. Since our model only has factor variables, they can all be interpreted in the same way and so we will pick a couple variable to showcase this. For example, the **Home_Brazil** coefficient implies that Brazil is $\exp(0.6892) \approx 2$ times more likely to score a goal than the average team in the training dataset.

Let's also consider the **Continent_Africa** coefficient. The value of this coefficient implies that holding everything else constant, an African team's scoring rate playing in an African tournament  is expected to be  $\exp(0.1315) \approx 1.14$  times higher than those of teams playing in international matches.

## Knockout phase model

## Model for Knockout stage phase: Logistic regression

In the World Cup knockout phase, there can only be a winner. If the match ends in a draw, then the team have to play penalties to determine the winner. 

Because the outcome is binary (either home win or away win), we are using a **logistic regression** to predict the winner.

$$ P(W_k) = \frac{1}{1+ exp(-intercept - alpha_i - beta_j - continent_i - home)} $$

$ W_k $ denotes the random variable for the match outcome. 1 denotes that the home side will win and 0 that the home side will not win (ie. the away side will win). Therefore, $ P(W_k) $ denotes the probability of a home win in the knockout match.

The independent variables in this model are the same as those used for the Poisson regression model used in the Group phase of the tournament.

### Candidate model 1

In [82]:
# Initial number of matches for the training set of the knockout stage model
len(train_knockout_constrained_df)

864

In [83]:
# Find the countries that have played both home and only away matches and only keep these in the dataset
countries_to_keep = set(train_knockout_constrained_df['home_team']) & set(train_knockout_constrained_df['away_team'])
train_knockout_constrained_df = train_knockout_constrained_df.loc[(train_knockout_constrained_df.loc[:,'home_team'].isin(countries_to_keep)) & (train_knockout_constrained_df.loc[: ,'away_team'].isin(countries_to_keep))]
print(len(train_knockout_constrained_df))
print(len(countries_to_keep))

730
44


In [84]:
train_knockout_constrained_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True
0,1916-07-02,Chile,Uruguay,0,4,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-4,Away win,South America,0,0,0,0,1,1
1,1916-07-06,Argentina,Chile,6,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Argentina,0,5,Home win,South America,0,0,0,0,1,0
2,1916-07-08,Brazil,Chile,1,1,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,1
3,1916-07-10,Argentina,Brazil,1,1,Copa América,Buenos Aires,Argentina,False,1916,Knockout,Draw,0,0,Draw,South America,0,0,0,0,1,0
4,1916-07-12,Brazil,Uruguay,1,2,Copa América,Buenos Aires,Argentina,True,1916,Knockout,Uruguay,0,-1,Away win,South America,0,0,0,0,1,1


In [85]:
# The derivation logic of the group vs knockout stage was imperfect due to the different tournament structure
# We will exclude the knockout games that ended in draw
train_knockout_constrained_df = train_knockout_constrained_df.loc[train_knockout_constrained_df.loc[:,'match_outcome']!='Draw']
print(len(train_knockout_constrained_df))

650


In [86]:
# Create the target variable
train_knockout_constrained_df['home_winner'] = np.where(train_knockout_constrained_df['match_outcome']=='Home win', 1, 0)

In [87]:
# Creating the same design matrix as for the Poisson model but with only one target variable in place this time
match_outcome = train_knockout_constrained_df['home_winner'].values
X = pd.concat([pd.get_dummies(train_knockout_constrained_df['home_team'], prefix='Home'), pd.get_dummies(train_knockout_constrained_df['away_team'], prefix='Away')], axis=1)

X['intercept'] = 1
X['home'] = 1
X[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_constrained_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']].apply(np.uint8)

In [88]:
# Add the identifiability restrictions as a linear system of equations
alpha_restr = np.concatenate([np.ones(shape=44), np.zeros(shape=51)])
beta_restr = np.concatenate([np.zeros(shape=44), np.ones(shape=44), np.zeros(shape=7)])
restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 

model_knockout_constrained = sm.GLM(match_outcome, X, family=sm.families.Binomial()).fit_constrained((restr , np.zeros(shape=2)))
#model_knockout_constrained = sm.GLM(match_outcome, X, family=sm.families.Binomial()).fit()
print(f'Log-likelihood of the model is: {model_knockout_constrained.llf:.2f}\n\n')
model_knockout_constrained.summary()

Log-likelihood of the model is: -318.43




0,1,2,3
Dep. Variable:,y,No. Observations:,650.0
Model:,GLM,Df Residuals:,560.0
Model Family:,Binomial,Df Model:,89.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-318.43
Date:,"Tue, 29 Nov 2022",Deviance:,636.87
Time:,23:37:28,Pearson chi2:,596.0
No. Iterations:,1,Pseudo R-squ. (CS):,0.2898
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Home_Algeria,4.7824,3.49e+04,0.000,1.000,-6.84e+04,6.84e+04
Home_Argentina,6.0763,3.49e+04,0.000,1.000,-6.84e+04,6.84e+04
Home_Belgium,2.7737,3.49e+04,7.94e-05,1.000,-6.84e+04,6.84e+04
Home_Brazil,5.1576,3.49e+04,0.000,1.000,-6.84e+04,6.84e+04
Home_Cameroon,3.3616,3.49e+04,9.63e-05,1.000,-6.84e+04,6.84e+04
Home_Canada,2.8446,3.49e+04,8.15e-05,1.000,-6.84e+04,6.84e+04
Home_Chile,4.0101,3.49e+04,0.000,1.000,-6.84e+04,6.84e+04
Home_China PR,-64.9940,1.49e+05,-0.000,1.000,-2.92e+05,2.92e+05
Home_Colombia,3.8130,3.49e+04,0.000,1.000,-6.84e+04,6.84e+04


The model at its current specification has too many predictors for just 650 data points. Therefore, we need to trim down the number of predictors to only include the countries that either play in the 2022 World Cup or the ones that made it to the knockout phase in 2018.

This will ensure a much more parsimonious model specification.

In [89]:
# Add the identifiability restrictions as a linear system of equations
#alpha_restr = np.concatenate([np.ones(shape=44), np.zeros(shape=51)])
#beta_restr = np.concatenate([np.zeros(shape=44), np.ones(shape=44), np.zeros(shape=7)])
#restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 
#regression_formula = "match_outcome ~ home_score + away_score + Continent_Africa + Continent_Asia & Oceania + Continent_Europe + Continent_North + Central America + Continent_South America"
#exogenous = train_knockout_constrained_df['home_winner']
#endogenous = train_knockout_constrained_df[['home_score', 'away_score', 'Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America', 'Continent_South America', 'Home_advantage_True']]
#endogenous = sm.add_constant(endogenous)

#endogenous[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
#       'Continent_South America', 'Home_advantage_True']] = endogenous[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
#       'Continent_South America', 'Home_advantage_True']].apply(np.uint8)

#endogenous = endogenous.astype(float)
#exogenous = exogenous.astype(float)

#model_knockout_constrained = sm.Logit(exogenous, endogenous).fit()
#model_knockout_constrained = sm.GLM(goals, X, family=sm.families.Poisson()).fit_constrained((restr , np.zeros(shape=2)))
#print(f'Log-likelihood of the model is: {model_knockout_constrained.llf:.2f}\n\n')
#model_kncockout_constrained.summary()

In [90]:
# Sum of all attacking and defending coefficients must be equal to or at least very close to zero
# Attacking abilities
print(model_group_constrained.params[0:44].values.sum())
# Defending abilities
print(model_group_constrained.params[44:88].values.sum())

1.8141284274487857
-2.0284000517883367


### Candidate model 2

In [91]:
target_countries = ['Uruguay', 'France', 'Brazil', 'Belgium', 'Russia', 'Croatia', 'Sweden', 'England',
                  'Portugal', 'Mexico', 'Japan', 'Spain', 'Denmark', 'Switzerland', 'Colombia', 'Argentina',
                 'Netherlands', 'Ecuador', 'Qatar', 'Senegal', 'Iran', 'USA', 'Wales', 'Poland', 'Saudi Arabia',
                 'Australia', 'Tunisia', 'Costa Rica', 'Germany', 'Morocco', 'Canada', 'Cameroon', 'Serbia',
                 'South Korea', 'Ghana']

target_countries_alt = ['Uruguay', 'France', 'Brazil', 'Belgium', 'Russia', 'Croatia', 'Sweden', 'England',
                  'Portugal', 'Mexico', 'Japan', 'Spain', 'Denmark', 'Switzerland', 'Colombia', 'Argentina',
                 'Netherlands', 'Ecuador', 'Senegal', 'Iran', 'USA', 'Poland',
                 'Germany', 'Morocco', 'Serbia', 'South Korea']

# Filter the constrained dataset to exclude countries that the model won't even be applied for
train_knockout_constrained_df = train_knockout_constrained_df.loc[(train_knockout_constrained_df.loc[:,'home_team'].isin(target_countries)) & (train_knockout_constrained_df.loc[: ,'away_team'].isin(target_countries))]
print(len(target_countries))
print(len(train_knockout_constrained_df))

35
271


In [92]:
# Creating the design matrix
match_outcome = train_knockout_constrained_df['home_winner'].values
X = pd.concat([pd.get_dummies(train_knockout_constrained_df['home_team'], prefix='Home'), pd.get_dummies(train_knockout_constrained_df['away_team'], prefix='Away')], axis=1)

X['intercept'] = 1
X['home'] = 1
X[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = train_group_constrained_df[['Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']]

X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']] = X[['home','Continent_Africa', 'Continent_Asia & Oceania', 'Continent_Europe', 'Continent_North & Central America',
       'Continent_South America']].apply(np.uint8)

In [93]:
sm.__version__

'0.13.2'

In [94]:
#model_knockout_constrained_regularized = sm.GLM(match_outcome, X, family=sm.families.Binomial()).fit_regularized(method='elastic_net', alpha=1.0, L1_wt=0.0)
#print(f'Log-likelihood of the model is: {model_knockout_constrained_regularized.llf:.2f}\n\n')
#model_knockout_constrained_regularized.summary()

In [95]:
# Add the identifiability restrictions as a linear system of equations
alpha_restr = np.concatenate([np.ones(shape=28), np.zeros(shape=35)])
beta_restr = np.concatenate([np.zeros(shape=28), np.ones(shape=28), np.zeros(shape=7)])
restr = np.vstack([alpha_restr, beta_restr]) # create the final restriction matrix

# Fit model and see output 

model_knockout_constrained = sm.GLM(match_outcome, X, family=sm.families.Binomial()).fit_constrained((restr , np.zeros(shape=2)))
#model_knockout_constrained = sm.GLM(match_outcome, X, family=sm.families.Binomial()).fit()
print(f'Log-likelihood of the model is: {model_knockout_constrained.llf:.2f}\n\n')
model_knockout_constrained.summary()

Log-likelihood of the model is: -119.32




0,1,2,3
Dep. Variable:,y,No. Observations:,271.0
Model:,GLM,Df Residuals:,213.0
Model Family:,Binomial,Df Model:,57.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-119.32
Date:,"Tue, 29 Nov 2022",Deviance:,238.63
Time:,23:37:28,Pearson chi2:,212.0
No. Iterations:,1,Pseudo R-squ. (CS):,0.3631
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Home_Argentina,7.7183,2.7e+05,2.86e-05,1.000,-5.29e+05,5.29e+05
Home_Belgium,5.2905,2.7e+05,1.96e-05,1.000,-5.29e+05,5.29e+05
Home_Brazil,7.2955,2.7e+05,2.71e-05,1.000,-5.29e+05,5.29e+05
Home_Cameroon,6.4199,2.7e+05,2.38e-05,1.000,-5.29e+05,5.29e+05
Home_Canada,32.9738,4.99e+05,6.61e-05,1.000,-9.78e+05,9.78e+05
Home_Colombia,5.7584,2.7e+05,2.14e-05,1.000,-5.29e+05,5.29e+05
Home_Costa Rica,-20.1251,2.89e+05,-6.97e-05,1.000,-5.66e+05,5.66e+05
Home_Denmark,5.6890,2.7e+05,2.11e-05,1.000,-5.29e+05,5.29e+05
Home_Ecuador,-18.6388,2.74e+05,-6.79e-05,1.000,-5.38e+05,5.38e+05


## Results and model testing

For each match in each test set, we will compare the predicted outcome from the Poisson and Logistic regression models to the actual match result.

For our models to be better than random guessing, the Poisson model must be accurate more than **33.3%** and the Logistic model must more accurate than **50%**.

## Predictions

### Group stage (2018 World Cup)

In [96]:
# Use the group stage dataset to predict the expected number of points
test_group_df.head(100)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True
8707,2018-06-14,Russia,Saudi Arabia,5,0,FIFA World Cup,Moscow,Russia,False,2018,Group,Russia,0,5,Home win,International,0,0,0,0,0,0
8708,2018-06-15,Egypt,Uruguay,0,1,FIFA World Cup,Ekaterinburg,Russia,True,2018,Group,Uruguay,0,-1,Away win,International,0,0,0,0,0,1
8709,2018-06-15,Morocco,Iran,0,1,FIFA World Cup,Saint Petersburg,Russia,True,2018,Group,Iran,0,-1,Away win,International,0,0,0,0,0,1
8710,2018-06-15,Portugal,Spain,3,3,FIFA World Cup,Sochi,Russia,True,2018,Group,Draw,0,0,Draw,International,0,0,0,0,0,1
8711,2018-06-16,France,Australia,2,1,FIFA World Cup,Kazan,Russia,True,2018,Group,France,0,1,Home win,International,0,0,0,0,0,1
8712,2018-06-16,Argentina,Iceland,1,1,FIFA World Cup,Moscow,Russia,True,2018,Group,Draw,0,0,Draw,International,0,0,0,0,0,1
8713,2018-06-16,Peru,Denmark,0,1,FIFA World Cup,Saransk,Russia,True,2018,Group,Denmark,0,-1,Away win,International,0,0,0,0,0,1
8714,2018-06-16,Croatia,Nigeria,2,0,FIFA World Cup,Kaliningrad,Russia,True,2018,Group,Croatia,0,2,Home win,International,0,0,0,0,0,1
8715,2018-06-17,Costa Rica,Serbia,0,1,FIFA World Cup,Samara,Russia,True,2018,Group,Serbia,0,-1,Away win,International,0,0,0,0,0,1
8716,2018-06-17,Germany,Mexico,0,1,FIFA World Cup,Moscow,Russia,True,2018,Group,Mexico,0,-1,Away win,International,0,0,0,0,0,1


### We will calculate the expected table rankings for each group by calculating the probability of win, loss and draw for each team in the 2018 World Cup

In [97]:
# Define a list of countries which played in the 2018 world cup
russia_world_cup_teams = ['Senegal', 'England', 'Iran',
'Argentina', 'Saudi Arabia', 'Mexico','Poland', 'France', 'Australia', 'Denmark', 'Tunisia','Spain',
'Costa Rica', 'Germany', 'Japan', 'Belgium', 'Morocco', 'Croatia', 'Brazil', 'Serbia',
'Switzerland', 'Portugal', 'Uruguay', 'South Korea', 'Russia', 'Egypt', 'Peru', 
'Nigeria', 'Iceland', 'Sweden', 'Panama', 'Colombia']

# Define the countries in each group
group_A = ['Uruguay', 'Russia', 'Saudi Arabia', 'Egypt']
group_B = ['Spain', 'Portugal', 'Iran', 'Morocco']
group_C = ['France', 'Denmark', 'Peru', 'Australia']
group_D = ['Croatia', 'Argentina', 'Nigeria', 'Iceland']
group_E = ['Brazil', 'Switzerland', 'Serbia', 'Costa Rica']
group_F = ['Sweden', 'Mexico', 'Germany', 'South Korea']
group_G = ['Belgium', 'England', 'Tunisia', 'Panama']
group_H = ['Colombia', 'Senegal', 'Poland', 'Japan']

# Create a list of lists for the groups
world_cup_groups = [group_A, group_B, group_C, group_D, group_E, group_F, group_G, group_H]

In [98]:
# Define a dictionary to extract the attacking abilities of the countries
attacking_strength_params = {val.split('_')[1] : model_group_constrained.params.values[i] for i , val in enumerate(model_group_constrained.params.index[:104]) if val.split('_')[1] in russia_world_cup_teams}

# Similarly, define a dictionary to extract the defensive abilities of the countries
defensive_strength_params = {val.split('_')[1] : model_group_constrained.params.values[104 + i] for i , val in enumerate(model_group_constrained.params.index[:104]) if val.split('_')[1] in russia_world_cup_teams}

Since we assume that $X_k$ and $Y_k$ are independent Poisson variables, we can use the fact that the random variable $Z_k := (X_k - Y_k) \sim Skellam(\lambda_k , \mu_k)$, where $Z_k \in \mathbb{Z}$. To calculate the required probabilities we use the following facts: 

* For a home win on match $k$, we need $X_k > Y_k$ so the probability of win is given by $ \mathbb{P}(Win) = \mathbb{P}(X_k > Y_k) = \mathbb{P}(Z_k > 0) = 1 - \mathbb{P}(Z_k \leq 0) $.

* Similarly, for a loss (aka Away win) we need $X_k < Y_k$, so the probability is given by $\mathbb{P}(Loss) = 1 - \mathbb{P}(W_k \leq 0) $, where $W_k := (Y_k - X_k) \sim Skellam( \mu_k, \lambda_k)$.

* Finally, the probability of a draw is simply $\mathbb{P}(Draw) = 1 - \mathbb{P}(Win) - \mathbb{P}(Loss)$. Alternatively, it is given by $\mathbb{P}(Draw) = \mathbb{P}(W_k = 0) = \mathbb{P}(Z_k = 0)$, this can be used as a sanity check to check the correctness of the calculated values.

In [99]:
# Let us predict the number of points per group
intercept = model_group_constrained.params.values[208]
continent_dummy = 0 # since all FIFA world cup matches are considered international and this is captured by the base scenario

for group in world_cup_groups:
    #We create an empty dictionary for each group to calculate the expected number of points   
    expected_points = {team : 0 for team in group}
    
    # subset the games of each group only
    sub_group_df = test_group_df.loc[(test_group_df.loc[:,'home_team'].isin(group)) | (test_group_df.loc[:,'away_team'].isin(group))]
    for i , row in sub_group_df.iterrows():
        
        home = row['home_team'] 
        away = row['away_team']
        
        # check
        print(home, away)
        
        home_advantage = 0
        if row['home_team'] == 'Russia' or row['away_team'] == 'Russia':
            home_advantage = model_group_constrained.params.values[209]          
        
        # we apply the formulas to calculate expected home and away goals
        expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
        expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])
        
        # we calculate the probability of win and loss using the cdf of the skellam distribution
        p_home_win = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
        p_away_win = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
        # probability of draw is simply 1 minus the probability of the other two outcomes
        p_draw = 1 - p_home_win - p_away_win
        
        # calculate the expected number of points (loss is omitted as it awards zero points)
        expected_points_home = 3 * p_home_win + 1 * p_draw
        expected_points_away = 3 * p_away_win + 1 * p_draw

        # collect the points and print the outcomes
        expected_points[home] += expected_points_home
        expected_points[away] += expected_points_away
        
    # Let's see the final results by sorting the dataframe according to expected points
    group_table = pd.DataFrame({'Team' : list(expected_points.keys()), 'Expected Points' : list(expected_points.values())})
    group_table = group_table.sort_values(by='Expected Points', ascending=False)
    print("The expected table standing is ... ")
    display(group_table)

Russia Saudi Arabia
Egypt Uruguay
Russia Egypt
Uruguay Saudi Arabia
Russia Uruguay
Saudi Arabia Egypt
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
1,Russia,6.268985
0,Uruguay,5.114143
3,Egypt,4.088133
2,Saudi Arabia,1.341351


Morocco Iran
Portugal Spain
Portugal Morocco
Iran Spain
Spain Morocco
Iran Portugal
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
0,Spain,6.323488
1,Portugal,4.834451
3,Morocco,3.629951
2,Iran,1.88377


France Australia
Peru Denmark
Denmark Australia
France Peru
Australia Peru
Denmark France
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
0,France,6.205457
1,Denmark,4.846324
2,Peru,3.453237
3,Australia,2.224608


Argentina Iceland
Croatia Nigeria
Argentina Croatia
Nigeria Iceland
Nigeria Argentina
Iceland Croatia
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
1,Argentina,5.937705
0,Croatia,5.303785
2,Nigeria,3.996137
3,Iceland,1.541562


Costa Rica Serbia
Brazil Switzerland
Brazil Costa Rica
Serbia Switzerland
Serbia Brazil
Switzerland Costa Rica
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
0,Brazil,6.552092
2,Serbia,4.067825
1,Switzerland,3.703845
3,Costa Rica,2.439286


Germany Mexico
Sweden South Korea
South Korea Mexico
Germany Sweden
South Korea Germany
Mexico Sweden
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
2,Germany,6.404963
0,Sweden,4.66988
1,Mexico,3.98533
3,South Korea,1.732998


Belgium Panama
Tunisia England
Belgium Tunisia
England Panama
Panama Tunisia
England Belgium
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
1,England,6.992461
0,Belgium,5.131365
2,Tunisia,3.887686
3,Panama,0.975795


Colombia Japan
Poland Senegal
Japan Senegal
Poland Colombia
Japan Poland
Senegal Colombia
The expected table standing is ... 


Unnamed: 0,Team,Expected Points
2,Poland,5.726649
0,Colombia,4.380731
1,Senegal,3.708693
3,Japan,2.684371


### Accuracy measure on group test set
Based on the probability of the match outcomes, we will assign the predicted match outcome and compare against the actual match outcome to measure the accuracy of the group stage model.

In [100]:
test_group_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True
8707,2018-06-14,Russia,Saudi Arabia,5,0,FIFA World Cup,Moscow,Russia,False,2018,Group,Russia,0,5,Home win,International,0,0,0,0,0,0
8708,2018-06-15,Egypt,Uruguay,0,1,FIFA World Cup,Ekaterinburg,Russia,True,2018,Group,Uruguay,0,-1,Away win,International,0,0,0,0,0,1
8709,2018-06-15,Morocco,Iran,0,1,FIFA World Cup,Saint Petersburg,Russia,True,2018,Group,Iran,0,-1,Away win,International,0,0,0,0,0,1
8710,2018-06-15,Portugal,Spain,3,3,FIFA World Cup,Sochi,Russia,True,2018,Group,Draw,0,0,Draw,International,0,0,0,0,0,1
8711,2018-06-16,France,Australia,2,1,FIFA World Cup,Kazan,Russia,True,2018,Group,France,0,1,Home win,International,0,0,0,0,0,1


In [101]:
# No need to subset into the 8 groups now
#expected_points = {team : 0 for team in group}
expected_goal_difference = []
expected_match_outcome = []
home_country = []
away_country = []

for i , row in test_group_df.iterrows():

    home = row['home_team'] 
    away = row['away_team']
    
    home_country.append(home)
    away_country.append(away)

    # check
    #print(home, away)

    home_advantage = 0
    if row['home_team'] == 'Russia' or row['away_team'] == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])
    expected_goal_diff = expected_home_goals - expected_away_goals
    expected_goal_difference.append(expected_goal_diff)

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    #probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    #print(p_draw)
        
    if (p_home_win > p_away_win) and (p_home_win > p_draw):
        expected_outcome = 'Home win'
    elif (p_home_win < p_away_win) and (p_away_win > p_draw):
        expected_outcome = 'Away win'   
    else:
        expected_outcome = 'Draw'
    expected_match_outcome.append(expected_outcome)  
    
predictions = pd.DataFrame(data= [home_country, away_country, expected_goal_difference, expected_match_outcome], index=['home_team', 'away_team', 'Expected_goal_difference', 'Expected_match_outcome']).transpose()
test_group_predict_df = test_group_df.merge(predictions, how='left', on=['home_team', 'away_team'])
test_group_predict_df['Correct_prediction'] = np.where(test_group_predict_df['match_outcome'] == test_group_predict_df['Expected_match_outcome'], 1, 0)
display(test_group_predict_df)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True,Expected_goal_difference,Expected_match_outcome,Correct_prediction
0,2018-06-14,Russia,Saudi Arabia,5,0,FIFA World Cup,Moscow,Russia,False,2018,Group,Russia,0,5,Home win,International,0,0,0,0,0,0,2.292119,Home win,1
1,2018-06-15,Egypt,Uruguay,0,1,FIFA World Cup,Ekaterinburg,Russia,True,2018,Group,Uruguay,0,-1,Away win,International,0,0,0,0,0,1,-0.400035,Away win,1
2,2018-06-15,Morocco,Iran,0,1,FIFA World Cup,Saint Petersburg,Russia,True,2018,Group,Iran,0,-1,Away win,International,0,0,0,0,0,1,0.639794,Home win,0
3,2018-06-15,Portugal,Spain,3,3,FIFA World Cup,Sochi,Russia,True,2018,Group,Draw,0,0,Draw,International,0,0,0,0,0,1,-0.640894,Away win,0
4,2018-06-16,France,Australia,2,1,FIFA World Cup,Kazan,Russia,True,2018,Group,France,0,1,Home win,International,0,0,0,0,0,1,1.595455,Home win,1
5,2018-06-16,Argentina,Iceland,1,1,FIFA World Cup,Moscow,Russia,True,2018,Group,Draw,0,0,Draw,International,0,0,0,0,0,1,2.054899,Home win,0
6,2018-06-16,Peru,Denmark,0,1,FIFA World Cup,Saransk,Russia,True,2018,Group,Denmark,0,-1,Away win,International,0,0,0,0,0,1,-0.573704,Away win,1
7,2018-06-16,Croatia,Nigeria,2,0,FIFA World Cup,Kaliningrad,Russia,True,2018,Group,Croatia,0,2,Home win,International,0,0,0,0,0,1,0.500713,Home win,1
8,2018-06-17,Costa Rica,Serbia,0,1,FIFA World Cup,Samara,Russia,True,2018,Group,Serbia,0,-1,Away win,International,0,0,0,0,0,1,-0.721405,Away win,1
9,2018-06-17,Germany,Mexico,0,1,FIFA World Cup,Moscow,Russia,True,2018,Group,Mexico,0,-1,Away win,International,0,0,0,0,0,1,1.003656,Home win,0


In [102]:
test_group_predict_df['Expected_match_outcome'].value_counts()

Home win    28
Away win    20
Name: Expected_match_outcome, dtype: int64

In [103]:
# Calculate the accuracy measure
print("The accuracy score on the group test set is {}%.".format((test_group_predict_df['Correct_prediction'].sum()/len(test_group_predict_df))*100))

The accuracy score on the group test set is 58.333333333333336%.


Our model seems to be fairly accurate in predicting whether the home or away team will win but it fails to predict any draws. Therefore, most incorrect predictions are those for which yhe match ended as draws. This is a model limitation because the Poisson model failts to predict draws with sufficiently high probability.

In [104]:
test_group_predict_df[['home_team', 'away_team','goal_difference','match_outcome','Expected_goal_difference','Expected_match_outcome','Correct_prediction']].head(50)

Unnamed: 0,home_team,away_team,goal_difference,match_outcome,Expected_goal_difference,Expected_match_outcome,Correct_prediction
0,Russia,Saudi Arabia,5,Home win,2.292119,Home win,1
1,Egypt,Uruguay,-1,Away win,-0.400035,Away win,1
2,Morocco,Iran,-1,Away win,0.639794,Home win,0
3,Portugal,Spain,0,Draw,-0.640894,Away win,0
4,France,Australia,1,Home win,1.595455,Home win,1
5,Argentina,Iceland,0,Draw,2.054899,Home win,0
6,Peru,Denmark,-1,Away win,-0.573704,Away win,1
7,Croatia,Nigeria,2,Home win,0.500713,Home win,1
8,Costa Rica,Serbia,-1,Away win,-0.721405,Away win,1
9,Germany,Mexico,-1,Away win,1.003656,Home win,0


### Knockout phase
Rather than use the predicted group standings from our group-phase model, we decided to use the actual standings because otherwise it would be difficult to attribute the mismatch in results to the knockout-phase model as opposed to the group-phase model.

#### Round of 16

In [110]:
# Specify the countries that qualified to the knockout round
top_finisher = ['Uruguay','France','Brazil','Belgium','Spain', 'Croatia','Sweden','Colombia']
runner_up = ['Portugal', 'Argentina', 'Mexico', 'Japan', 'Russia', 'Denmark', 'Switzerland', 'England']
round_sixteen = list(zip(top_finisher,runner_up))


In [112]:
# Let us predict the outcomes of the round of 16
intercept = model_group_constrained.params.values[208]
continent_dummy = 0 # since all FIFA world cup matches are considered international and this is captured by the base scenario

for pair in round_sixteen:
    home, away = pair[0] , pair[1]
    print("Predicting match outcome for: " + home + " vs " + away)
    
    home_advantage = 0
    if home == 'Russia' or away == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old / (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the round of 16
    round_of_sixteen = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check','Winner']).transpose()
    display(round_of_sixteen)

Predicting match outcome for: Uruguay vs Portugal


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.403384,0.596616,1.0,Portugal


Predicting match outcome for: France vs Argentina


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.514008,0.485992,1.0,France


Predicting match outcome for: Brazil vs Mexico


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.764843,0.235157,1.0,Brazil


Predicting match outcome for: Belgium vs Japan


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.841485,0.158515,1.0,Belgium


Predicting match outcome for: Spain vs Russia


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.627988,0.372012,1.0,Spain


Predicting match outcome for: Croatia vs Denmark


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.576679,0.423321,1.0,Croatia


Predicting match outcome for: Sweden vs Switzerland


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.618543,0.381457,1.0,Sweden


Predicting match outcome for: Colombia vs England


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.153615,0.846385,1.0,England


#### Quarter finals

In [114]:
# Specify the countries that passed the round of 16
top_finisher = ['France','Brazil', 'Spain', 'Sweden']
runner_up = ['Portugal', 'Belgium', 'Croatia', 'England']
quarters = list(zip(top_finisher,runner_up))

In [116]:
for pair in quarters:
    print("Predicting match outcome for: " + pair[0] + " vs " + pair[1])
    home, away = pair[0] , pair[1]

    home_advantage = 0
    if home == 'Russia' or away == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the quarter finals
    quarter_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
    display(quarter_results)

Predicting match outcome for: France vs Portugal


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.564978,0.435022,1.0,France


Predicting match outcome for: Brazil vs Belgium


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.739001,0.260999,1.0,Brazil


Predicting match outcome for: Spain vs Croatia


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.692146,0.307854,1.0,Spain


Predicting match outcome for: Sweden vs England


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.281671,0.718329,1.0,England


#### Semi-finals

In [117]:
# Specify the countries that passed the quarters
top_finisher = ['France', 'Spain']
runner_up = ['Brazil', 'England']
semis = list(zip(top_finisher,runner_up))

In [118]:
for pair in semis:
    print("Predicting match outcome for: " + pair[0] + " vs " + pair[1])
        
    home, away = pair[0], pair[1]

    home_advantage = 0
    if home == 'Russia' or away == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the quarter finals
    semis_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
    display(semis_results)

Predicting match outcome for: France vs Brazil


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.379945,0.620055,1.0,Brazil


Predicting match outcome for: Spain vs England


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.476636,0.523364,1.0,England


#### Final and third-place play off

In [119]:
# Third place
print("Predicting match outcome for the third place.")

home = 'France'
away = 'Spain'

home_advantage = 0        

# we apply the formulas to calculate expected home and away goals
expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

# we calculate the probability of win and loss using the cdf of the skellam distribution
p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
# probability of draw is simply 1 minus the probability of the other two outcomes
p_draw = 1 - p_home_win - p_away_win

# because we cannot have draws in the knockout phase, we must ignore the prob of draw
# and rescale the other probabilities

p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
total = p_home_win + p_away_win

if p_home_win > p_away_win:
    winner = home
else:
    winner = away

# Let's print the results of the quarter finals
playoff_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
display(playoff_results)

Predicting match outcome for the third place.


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.382306,0.617694,1.0,Spain


In [120]:
# Final 
home = 'Brazil'
away = 'England'

home_advantage = 0        

# we apply the formulas to calculate expected home and away goals
expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

# we calculate the probability of win and loss using the cdf of the skellam distribution
p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
# probability of draw is simply 1 minus the probability of the other two outcomes
p_draw = 1 - p_home_win - p_away_win

# because we cannot have draws in the knockout phase, we must ignore the prob of draw
# and rescale the other probabilities

p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
total = p_home_win + p_away_win

if p_home_win > p_away_win:
    winner = home
else:
    winner = away

print("The winner of the Russia 2018 World Cup is {}!!!".format(winner))

# Let's print the results of the quarter finals
final_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
display(final_results)

The winner of the Russia 2018 World Cup is England!!!


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.481831,0.518169,1.0,England


### Accuracy measure on knockout test set

In [121]:
knockout_stage_df = split_group_df.loc[split_group_df.loc[:,'tournament_phase']=='Knockout']
test_knockout_df = knockout_stage_df.loc[(knockout_stage_df.loc[:,'tournament']=='FIFA World Cup') & (knockout_stage_df.loc[:,'year']==2018)]
print(len(test_knockout_df))

16


In [122]:
expected_goal_difference = []
expected_match_outcome = []
home_country = []
away_country = []

for i , row in test_knockout_df.iterrows():

    home = row['home_team'] 
    away = row['away_team']
    
    home_country.append(home)
    away_country.append(away)

    # check
    #print(home, away)

    home_advantage = 0
    if row['home_team'] == 'Russia' or row['away_team'] == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])
    expected_goal_diff = expected_home_goals - expected_away_goals
    expected_goal_difference.append(expected_goal_diff)

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    #probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    #print(p_draw)
        
    if (p_home_win > p_away_win) and (p_home_win > p_draw):
        expected_outcome = 'Home win'
    elif (p_home_win < p_away_win) and (p_away_win > p_draw):
        expected_outcome = 'Away win'   
    else:
        expected_outcome = 'Draw'
    expected_match_outcome.append(expected_outcome)  
    
predictions = pd.DataFrame(data= [home_country, away_country, expected_goal_difference, expected_match_outcome], index=['home_team', 'away_team', 'Expected_goal_difference', 'Expected_match_outcome']).transpose()
test_knockout_predict_df = test_knockout_df.merge(predictions, how='left', on=['home_team', 'away_team'])
test_knockout_predict_df['Correct_prediction'] = np.where(test_knockout_predict_df['match_outcome'] == test_knockout_predict_df['Expected_match_outcome'], 1, 0)
display(test_knockout_predict_df)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,tournament_phase,winner,penalty_shootout,goal_difference,match_outcome,continent_football,Continent_Africa,Continent_Asia & Oceania,Continent_Europe,Continent_North & Central America,Continent_South America,Home_advantage_True,Expected_goal_difference,Expected_match_outcome,Correct_prediction
0,2018-06-30,France,Argentina,4,3,FIFA World Cup,Kazan,Russia,True,2018,Knockout,France,0,1,Home win,International,0,0,0,0,0,1,0.047409,Home win,1
1,2018-06-30,Uruguay,Portugal,2,1,FIFA World Cup,Sochi,Russia,True,2018,Knockout,Uruguay,0,1,Home win,International,0,0,0,0,0,1,-0.329268,Away win,0
2,2018-07-01,Russia,Spain,1,1,FIFA World Cup,Moscow,Russia,False,2018,Knockout,Russia,1,0,Draw,International,0,0,0,0,0,0,-0.436306,Away win,0
3,2018-07-01,Croatia,Denmark,1,1,FIFA World Cup,Nizhny Novgorod,Russia,True,2018,Knockout,Croatia,1,0,Draw,International,0,0,0,0,0,1,0.263498,Home win,0
4,2018-07-02,Brazil,Mexico,2,0,FIFA World Cup,Samara,Russia,True,2018,Knockout,Brazil,0,2,Home win,International,0,0,0,0,0,1,0.945022,Home win,1
5,2018-07-02,Belgium,Japan,3,2,FIFA World Cup,Rostov-on-Don,Russia,True,2018,Knockout,Belgium,0,1,Home win,International,0,0,0,0,0,1,1.473135,Home win,1
6,2018-07-03,Sweden,Switzerland,1,0,FIFA World Cup,Saint Petersburg,Russia,True,2018,Knockout,Sweden,0,1,Home win,International,0,0,0,0,0,1,0.424747,Home win,1
7,2018-07-03,Colombia,England,1,1,FIFA World Cup,Moscow,Russia,True,2018,Knockout,England,1,0,Away win,International,0,0,0,0,0,1,-1.245062,Away win,1
8,2018-07-06,Uruguay,France,0,2,FIFA World Cup,Nizhny Novgorod,Russia,True,2018,Knockout,France,0,-2,Away win,International,0,0,0,0,0,1,-0.540344,Away win,1
9,2018-07-06,Brazil,Belgium,1,2,FIFA World Cup,Kazan,Russia,True,2018,Knockout,Belgium,0,-1,Away win,International,0,0,0,0,0,1,0.905532,Home win,0


In [123]:
# Calculate the accuracy measure
print("The accuracy score on the knockout test set is {}%.".format((test_knockout_predict_df['Correct_prediction'].sum()/len(test_knockout_predict_df))*100))

The accuracy score on the knockout test set is 56.25%.


In [124]:
test_knockout_predict_df['Correct_prediction'].value_counts()

1    9
0    7
Name: Correct_prediction, dtype: int64

In [125]:
test_knockout_predict_df[['home_team', 'away_team','date','goal_difference','match_outcome','Expected_goal_difference','Expected_match_outcome','Correct_prediction']].head(16)

Unnamed: 0,home_team,away_team,date,goal_difference,match_outcome,Expected_goal_difference,Expected_match_outcome,Correct_prediction
0,France,Argentina,2018-06-30,1,Home win,0.047409,Home win,1
1,Uruguay,Portugal,2018-06-30,1,Home win,-0.329268,Away win,0
2,Russia,Spain,2018-07-01,0,Draw,-0.436306,Away win,0
3,Croatia,Denmark,2018-07-01,0,Draw,0.263498,Home win,0
4,Brazil,Mexico,2018-07-02,2,Home win,0.945022,Home win,1
5,Belgium,Japan,2018-07-02,1,Home win,1.473135,Home win,1
6,Sweden,Switzerland,2018-07-03,1,Home win,0.424747,Home win,1
7,Colombia,England,2018-07-03,0,Away win,-1.245062,Away win,1
8,Uruguay,France,2018-07-06,-2,Away win,-0.540344,Away win,1
9,Brazil,Belgium,2018-07-06,-1,Away win,0.905532,Home win,0


## 2022 FIFA World Cup winner prediction

In [134]:
qatar_match_schedule_df = pd.read_csv("Data/qatar_match_schedule.csv", delimiter=';')
qatar_match_schedule_df.rename(columns={'country1':'home_team', 'coutry2':'away_team'}, inplace=True)
qatar_match_schedule_df.head()

Unnamed: 0,match,date,home_team,away_team,phase
0,1,21/11/2022,Qatar,Ecuador,group matches
1,2,21/11/2022,Senegal,Netherlands,group matches
2,3,21/11/2022,England,Iran,group matches
3,4,21/11/2022,USA,Wales,group matches
4,5,22/11/2022,France,Australia,group matches


In [145]:
# Define a dictionary to extract the attacking abilities of the countries
qatar_world_cup_teams = ['Qatar', 'Netherlands', 'Senegal', 'Ecuador', 'England', 'Wales', 'Iran', 'USA',
                        'Argentina', 'Saudi Arabia', 'Mexico', 'Poland', 'France', 'Australia', 'Denmark', 'Tunisia',
                        'Spain', 'Germany', 'Japan', 'Costa Rica', 'Belgium', 'Morocco', 'Canada', 'Croatia',
                        'Brazil', 'Serbia', 'Cameroon', 'Switzerland', 'Portugal', 'Ghana', 'Uruguay', 'South Korea']

attacking_strength_params = {val.split('_')[1] : model_group_constrained.params.values[i] for i , val in enumerate(model_group_constrained.params.index[:104]) if val.split('_')[1] in qatar_world_cup_teams}

# Similarly, define a dictionary to extract the defensive abilities of the countries
defensive_strength_params = {val.split('_')[1] : model_group_constrained.params.values[104 + i] for i , val in enumerate(model_group_constrained.params.index[:104]) if val.split('_')[1] in qatar_world_cup_teams}

In [146]:
# Define a list of countries which played in the 2022 world cup

# Define the countries in each group
group_A = ['Qatar', 'Netherlands', 'Senegal', 'Ecuador']
group_B = ['England', 'Wales', 'Iran', 'USA']
group_C = ['Argentina', 'Saudi Arabia', 'Mexico', 'Poland']
group_D = ['France', 'Australia', 'Denmark', 'Tunisia']
group_E = ['Spain', 'Germany', 'Japan', 'Costa Rica']
group_F = ['Belgium', 'Morocco', 'Canada', 'Croatia']
group_G = ['Brazil', 'Serbia', 'Cameroon', 'Switzerland']
group_H = ['Portugal', 'Ghana', 'Uruguay', 'South Korea']

# Create a list of lists for the groups
world_cup_groups = [group_A, group_B, group_C, group_D, group_E, group_F, group_G, group_H]

In [147]:
# Let us predict the number of points per group
intercept = model_group_constrained.params.values[208]
continent_dummy = 0 # since all FIFA world cup matches are considered international and this is captured by the base scenario

for group in world_cup_groups:
    #We create an empty dictionary for each group to calculate the expected number of points   
    expected_points = {team : 0 for team in group}
    
    # subset the games of each group only
    sub_group_df = qatar_match_schedule_df.loc[(qatar_match_schedule_df.loc[:,'home_team'].isin(group)) | (qatar_match_schedule_df.loc[:,'away_team'].isin(group))]
    for i , row in sub_group_df.iterrows():
        
        home = row['home_team'] 
        away = row['away_team']
        
        # check
        print(home, away)
        
        home_advantage = 0        
        
        # we apply the formulas to calculate expected home and away goals
        expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
        expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])
        
        # we calculate the probability of win and loss using the cdf of the skellam distribution
        p_home_win = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
        p_away_win = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
        # probability of draw is simply 1 minus the probability of the other two outcomes
        p_draw = 1 - p_home_win - p_away_win
        
        # calculate the expected number of points (loss is omitted as it awards zero points)
        expected_points_home = 3 * p_home_win + 1 * p_draw
        expected_points_away = 3 * p_away_win + 1 * p_draw

        # collect the points and print the outcomes
        expected_points[home] += expected_points_home
        expected_points[away] += expected_points_away
        
    # Let's see the final results by sorting the dataframe according to expected points
    group_table = pd.DataFrame({'Team' : list(expected_points.keys()), 'Expected Points' : list(expected_points.values())})
    group_table = group_table.sort_values(by='Expected Points', ascending=False)
    print("The expected table standing for 2022 World Cup is ... ")
    display(group_table)

Qatar Ecuador
Senegal Netherlands
Qatar Senegal
Netherlands Ecuador
Ecuador Senegal
Netherlands Qatar
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
1,Netherlands,7.575522
2,Senegal,4.202604
3,Ecuador,4.085887
0,Qatar,1.144191


England Iran
USA Wales
Wales Iran
England USA
Wales England
Iran USA
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,England,7.161362
1,Wales,3.795264
3,USA,3.134886
2,Iran,2.63188


Mexico Poland
Argentina Saudi Arabia
Poland Saudi Arabia
Argentina Mexico
Poland Argentina
Saudi Arabia Mexico
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,Argentina,6.105358
3,Poland,4.888039
2,Mexico,4.758207
1,Saudi Arabia,1.130761


France Australia
Denmark Tunisia
Tunisia Australia
France Denmark
Australia Denmark
Tunisia France
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,France,6.093868
2,Denmark,4.740189
3,Tunisia,3.679202
1,Australia,2.164976


Spain Costa Rica
Germany Japan
Japan Costa Rica
Spain Germany
Japan Spain
Costa Rica Germany
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,Spain,6.442523
1,Germany,6.353046
3,Costa Rica,2.623509
2,Japan,1.582412


Belgium Canada
Morocco Croatia
Belgium Morocco
Croatia Canada
Croatia Belgium
Canada Morocco
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
3,Croatia,5.342008
0,Belgium,5.081389
1,Morocco,4.279375
2,Canada,1.921481


Switzerland Cameroon
Brazil Serbia
Cameroon Serbia
Brazil Switzerland
Serbia Switzerland
Cameroon Brazil
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,Brazil,6.49569
1,Serbia,4.006692
3,Switzerland,3.645519
2,Cameroon,2.58213


Uruguay South Korea
Portugal Ghana
South Korea Ghana
Portugal Uruguay
Ghana Uruguay
South Korea Portugal
The expected table standing for 2022 World Cup is ... 


Unnamed: 0,Team,Expected Points
0,Portugal,5.744145
2,Uruguay,4.891408
1,Ghana,3.496903
3,South Korea,2.412352


#### Round of 16

In [148]:
# Specify the countries that qualified to the knockout round
top_finisher = ['Netherlands','England','Argentina','France','Spain', 'Croatia','Brazil','Portugal']
runner_up = ['Wales', 'Senegal', 'Denmark', 'Poland', 'Belgium', 'Germany', 'Uruguay', 'Serbia']
round_sixteen = list(zip(top_finisher,runner_up))

In [149]:
intercept = model_group_constrained.params.values[208]
continent_dummy = 0 # since all FIFA world cup matches are considered international and this is captured by the base scenario

for pair in round_sixteen:
    home, away = pair[0] , pair[1]
    print("Predicting match outcome for: " + home + " vs " + away)
    
    home_advantage = 0      

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old / (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the round of 16
    round_of_sixteen = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check','Winner']).transpose()
    display(round_of_sixteen)

Predicting match outcome for: Netherlands vs Wales


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.819,0.181,1.0,Netherlands


Predicting match outcome for: England vs Senegal


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.88551,0.11449,1.0,England


Predicting match outcome for: Argentina vs Denmark


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.649226,0.350774,1.0,Argentina


Predicting match outcome for: France vs Poland


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.664158,0.335842,1.0,France


Predicting match outcome for: Spain vs Belgium


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.739248,0.260752,1.0,Spain


Predicting match outcome for: Croatia vs Germany


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.322631,0.677369,1.0,Germany


Predicting match outcome for: Brazil vs Uruguay


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.755804,0.244196,1.0,Brazil


Predicting match outcome for: Portugal vs Serbia


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.613433,0.386567,1.0,Portugal


#### Quarter finals

In [152]:
# Specify the countries that qualified to the quarters
top_finisher = ['Netherlands','England','Spain','Portugal']
runner_up = ['Argentina', 'France', 'Brazil', 'Germany']
quarters = list(zip(top_finisher,runner_up))

In [153]:
for pair in quarters:
    print("Predicting match outcome for: " + pair[0] + " vs " + pair[1])
    home, away = pair[0] , pair[1]

    home_advantage = 0
    if home == 'Russia' or away == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the quarter finals
    quarter_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
    display(quarter_results)

Predicting match outcome for: Netherlands vs Argentina


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.580629,0.419371,1.0,Netherlands


Predicting match outcome for: England vs France


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.63561,0.36439,1.0,England


Predicting match outcome for: Spain vs Brazil


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.495081,0.504919,1.0,Brazil


Predicting match outcome for: Portugal vs Germany


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.33773,0.66227,1.0,Germany


#### Semi-finals

In [154]:
# Specify the countries that qualified to the quarters
top_finisher = ['Netherlands','England']
runner_up = [ 'Brazil', 'Germany']
semis = list(zip(top_finisher,runner_up))

In [155]:
for pair in semis:
    print("Predicting match outcome for: " + pair[0] + " vs " + pair[1])
        
    home, away = pair[0], pair[1]

    home_advantage = 0
    if home == 'Russia' or away == 'Russia':
        home_advantage = model_group_constrained.params.values[209]          

    # we apply the formulas to calculate expected home and away goals
    expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
    expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

    # we calculate the probability of win and loss using the cdf of the skellam distribution
    p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
    p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
    # probability of draw is simply 1 minus the probability of the other two outcomes
    p_draw = 1 - p_home_win - p_away_win
    
    # because we cannot have draws in the knockout phase, we must ignore the prob of draw
    # and rescale the other probabilities
    
    p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
    p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
    total = p_home_win + p_away_win
    
    if p_home_win > p_away_win:
        winner = home
    else:
        winner = away
              
    # Let's print the results of the quarter finals
    semis_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
    display(semis_results)

Predicting match outcome for: Netherlands vs Brazil


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.444147,0.555853,1.0,Brazil


Predicting match outcome for: England vs Germany


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.549106,0.450894,1.0,England


#### Third place

In [156]:
# Third place
print("Predicting match outcome for the third place.")

home = 'Netherlands'
away = 'Germany'

home_advantage = 0        

# we apply the formulas to calculate expected home and away goals
expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

# we calculate the probability of win and loss using the cdf of the skellam distribution
p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
# probability of draw is simply 1 minus the probability of the other two outcomes
p_draw = 1 - p_home_win - p_away_win

# because we cannot have draws in the knockout phase, we must ignore the prob of draw
# and rescale the other probabilities

p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
total = p_home_win + p_away_win

if p_home_win > p_away_win:
    winner = home
else:
    winner = away

# Let's print the results of the quarter finals
playoff_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
display(playoff_results)

Predicting match outcome for the third place.


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.471455,0.528545,1.0,Germany


#### Final

In [157]:
# Final 
home = 'Brazil'
away = 'England'

home_advantage = 0        

# we apply the formulas to calculate expected home and away goals
expected_home_goals = np.exp(intercept + continent_dummy + home_advantage + attacking_strength_params[home] + defensive_strength_params[away])
expected_away_goals = np.exp(intercept + continent_dummy + attacking_strength_params[away] + defensive_strength_params[home])

# we calculate the probability of win and loss using the cdf of the skellam distribution
p_home_win_old = 1 - skellam.cdf(0, expected_home_goals, expected_away_goals)
p_away_win_old = 1 - skellam.cdf(0, expected_away_goals, expected_home_goals)
# probability of draw is simply 1 minus the probability of the other two outcomes
p_draw = 1 - p_home_win - p_away_win

# because we cannot have draws in the knockout phase, we must ignore the prob of draw
# and rescale the other probabilities

p_home_win = p_home_win_old/ (p_home_win_old + p_away_win_old)
p_away_win = p_away_win_old/ (p_home_win_old + p_away_win_old)
total = p_home_win + p_away_win

if p_home_win > p_away_win:
    winner = home
else:
    winner = away

print("The winner of the Qatar 2022 World Cup is {}!!!".format(winner))

# Let's print the results of the quarter finals
final_results = pd.DataFrame(data= [p_home_win, p_away_win, total, winner], index=['Home win probability', 'Away win probability', 'Sense check', 'Winner']).transpose()
display(final_results)

The winner of the Qatar 2022 World Cup is England!!!


Unnamed: 0,Home win probability,Away win probability,Sense check,Winner
0,0.481831,0.518169,1.0,England


## Conclusion