* Qatar world cup is going to last from 20 November to 18 December 2022
* Read the original paper from :: https://iopscience.iop.org/article/10.1088/1742-6596/1108/1/012066/pdf
* The codebase is borrowed from Frank Andrade's implementation : https://towardsdatascience.com/predicting-the-fifa-world-cup-2022-with-a-simple-model-using-python-6b34bdd4f2a5
* [I am still in the process of finding an improvement in his code lolz :D]
* The code only works for `Python=3.8` version btw :P

# 0. Modules and Dataset Import

`N.B` If you get a notification on `unpickle_load` module not present --> solution, update Python to 3.8.15
* The paper assumes that number of goals scored by a team followed a `univariate Poisson distribution`
* For their model they used a `Poisson-regression model`
* The covariates of this model took form : 
    * a) goal average per match
    * b) home-team advantage
    * c) team's offensive power
    * d) opponent team's defensive power
* Also, assume that the goals scored by two teams are two `independent events`
* Some folks have also used the `Skellam’s distribution` to model the difference of goals between home and away teams (Skellam in general is ideal for difference in discrete points)
* To evaluate, the `Definetti` measure is used to assess performance

In [1]:
# !pip install pandas
# !pip install scipy

* `dict_table` was scrapped by the author from different websites and then loaded to pickle
* check webscraping of football from here :: https://www.youtube.com/watch?v=XDIscigGpGI
* also author has stored `clean` versions of the data --> this has all the pre-processing stages required, you can review these steps on :: https://towardsdatascience.com/a-straightforward-guide-to-cleaning-and-preparing-data-in-python-8c82f209ae33
* `df_historical_data` has the data from the past ranging from years *1930* to *2018*
* the `dict_table` will have initial zeroes in the columns since we have not run any simulations of football events and neither done any predictions nor used any models yet
* the column will fill as the ML models predict the winners in upcoming matches in Qatar 2022

In [2]:
import pandas as pd
import pickle as pkl
from scipy.stats import poisson

# ---------------------------------------------------------------------------------- #

dict_table = pd.read_pickle(open('./dataset/dict_table', 'rb'))
df_historical_data = pd.read_csv('./dataset/clean_fifa_worldcup_matches.csv')
df_fixture = pd.read_csv('./dataset/clean_fifa_worldcup_fixture.csv')

# 1. Team Strength Quantification

**STEP 1:**

* Split the histroical data into two data_frames: `df_home` and `df_away`
* `df_home` will store columns: `HomeTeam` , `HomeGoals` , `AwayGoals`
* `df_away` will store columns : `AwayTeam` , `HomeGoals` , `AwayGoals`

In [3]:
df_home = df_historical_data[['HomeTeam', 'HomeGoals', 'AwayGoals']]
df_away = df_historical_data[['AwayTeam', 'HomeGoals', 'AwayGoals']]

**STEP 2:**

* Rename the columns, pure cosmetic change, nothing functional dependant

In [4]:
df_home = df_home.rename(columns={
    'HomeTeam': 'Team', 'HomeGoals': 'Goals Scored', 'AwayGoals': 'Goals Conceded'})
df_away = df_away.rename(columns={
    'AwayTeam': 'Team', 'HomeGoals': 'Goals Conceded', 'AwayGoals': 'Goals Scored'})

**STEP 3:**

* Concatenate `df_home` and `df_away` and calculate mean for goals scored and conceded
* This gives us an idea of how good or how bad is the team, an average idea of `Team Strength`
* This is going to be the base of all the predictions

In [5]:
df_team_strength = pd.concat(
    [df_home, df_away], ignore_index=True).groupby('Team').mean()

# 2. Setup Predictive Function with Poisson

## 2.1. Description

* The idea here is to predict how many points one team is going to get
* In total we have 8 groups: group A, B, C --> H
* In each group we have many matches ...
* ... After one team wins a match --> it gets `3` points
* If match is draw or a tie, each team gets `1` point
* And if team loses it gets `0` point

---

* When picking a distribution to model an event, you have to see how the dynamics of the event must `match` the underlying `statistical` `assumptions` for a distribution
* Assumptions made in a Poisson distribution setting:

    * The number of events that can be `counted`
        > For football eg. the event is the number of goals that ofcourse can be counted, we cannot have 1.5 or 3.7 goals, the counting will be in discrete form

    * The `occurence` of events are `independent`
        > For football eg. the probability of one goal will not affect the probability of another goal

        > Obviously we can consider more complex way of modelling the variables such as `motivation` of team relating to high chance of scoring a goal, but in the simple case we assume independence, that factor may not always be `highly correlated`

    * The `rate` at which the event occurs is `consistent`

        > I dont agree with this point in football eg. but the author says that the probability of a goal in one match will be the same for another match ??? Not sure if this is true --> [needs more research]

    * Two `seperate` events cannot occur at `exactly` the `same time`

        > Obviously one goal made by one team cannot happen at the same time as the opposite team scores their goal. The two events cannot be `simultaneous in time`. So this assumption fits more closely than te rest

---

The Poisson distribution, mathematically, takes the form

\begin{equation}
    P(\text{X} = x) = \frac{\lambda^x e^{- \lambda}}{x!}
\end{equation}

* The main factor is $\lambda$ which represents the number of events in a time interval
* For the football eg. $\lambda$ will represent the `median` of goals within the `90min` time interval
* This is for both Team A and Team B, we will have two lambdas : $\lambda_A$ and $\lambda_B$ 
* The $x$ represents the `number of goals` in a `match` ... these could be scored by A (home) $x_A$ or B (away) $x_B$

---

## 2.2. Main Predictive Function

In [6]:
def predictiveFunc(home, away):
    """
    @desc: if the home and away team are within the df_team_strength indices, then for that team
    """
    if home in df_team_strength.index and away in df_team_strength.index:

        # assign lambda_a and lambda_b
        lambda_home = df_team_strength.at[home, 'Goals Scored'] * \
            df_team_strength.at[away, 'Goals Conceded']
        lambda_away = df_team_strength.at[away, 'Goals Scored'] * \
            df_team_strength.at[home, 'Goals Conceded']

        # initialise some probability variables
        prob_home, prob_away, prob_draw = 0, 0, 0

        # loop through number of goals to be made by the home team
        for x in range(0, 11):
            # loop through number of goals to be made by the away team
            for y in range(0, 11):
                # sample from the poisson pmf
                p = poisson.pmf(x, lambda_home) * poisson.pmf(y, lambda_away)

                # if the scores equal then draw prob is added with the sample
                if x == y:
                    prob_draw += p

                # if the home scores more than away team, increase prob_home by p
                elif x > y:
                    prob_home += p

                # if the home scores less than away team, increase prob_away by p
                elif x < y:
                    prob_away += p

        # set points for the home, wining team will have 3 points, in that case prob_home will be 1 and prob_draw will be 0
        points_home = 3 * prob_home + prob_draw
        points_away = 3 * prob_away + prob_draw

        return points_home, points_away

    else:
        # this is for the condition when a new team joins that has no historical data (participation in World Cup) that is needed to predict their performance
        return (0, 0)


# 3. World Cup Prediction

## 3.1. Split the Fixtures Into Stages

* The stages can be divided into
    * Group 48
    * Knockout
    * Quarter
    * Semi-Final
    * Final

In [7]:
df_fixture_group_48 = df_fixture[:48].copy()
df_fixture_knockout = df_fixture[48:56].copy()
df_fixture_quarter = df_fixture[56:60].copy()
df_fixture_semi = df_fixture[60:62].copy()
df_fixture_final = df_fixture[62:].copy()

## 3.2. Simulate Running of the Matches (Pre-Knockout)

* Before running the simulation, it is necessary to understand the football dynamics, basically how are the matches decided
* Each of the participating teams are divided into groups of four
* Each group will have six matches, there are a total of 48 matches that will occur before the knockout phase hits
* You can see the individual groups iterating through the `dict_table['Group X']['Team']`, group X as in any of the groups from A-H
* The algorithm is simple, it follows :
    * Divide the teams into their respective groups using the command `dict_table['Group X']['Team'].values`
    * Store this as the variable `teams_in_group`
    * Now each team in a group will have 6 matches, you have to find these matches from the `df_fixture_group` and then isolate ones that have matching team names
    * Store this as `df_fixture_group_6`
    * We have the matches, now loop through them and predict points based on the Poisson regression model 
    * Once the predictions made, match the respective team in the `dict_table` and update the `Pts` column which represents the number of points
    * The last few steps are mainly post-processing

In [9]:
for group in dict_table:
    # isolate the matches based on the teams in the groups
    teams_in_group = dict_table[group]['Team'].values
    df_fixture_group_6 = df_fixture_group_48[df_fixture_group_48['home'].isin(teams_in_group)]

    # loop through the matches and predict
    for index, row in df_fixture_group_6.iterrows():
        home, away = row['home'], row['away']
        points_home, points_away = predictiveFunc(home, away)
        dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home
        dict_table[group].loc[dict_table[group]['Team'] == away, 'Pts'] += points_away

    # sort the table based on the points in descending fashion and then reset_index
    dict_table[group] = dict_table[group].sort_values('Pts', ascending=False).reset_index() 
    # select only the 'Team' and 'Pts' columns
    dict_table[group] = dict_table[group][['Team', 'Pts']]
    # round the values
    dict_table[group] = dict_table[group].round(0)

In [11]:
# display the table
dict_table['Group H']

Unnamed: 0,Team,Pts
0,Portugal,6.0
1,Uruguay,5.0
2,Ghana,4.0
3,South Korea,2.0


## 3.3. Simulate Running of the Matches (Knockout)