# Bayesian Inference - 2019 FIFA Women's World Cup

#### General Remarks

Predicting the world cup is difficult because there's not a lot of historical data.

* Very little team data
  + Different teams every 4 years
  + Different players on each team

* Some player data

### 2019 FIFA Women's World Cup Data

- Data downloaded from [538 by Nate Silver](https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/)

- Number of teams participating: 24
- Per-team Soccer Power Index (SPI) score - team ranking prior to start of tournament
- Number of matches played up through the end of the quarterfinals round:  48
- Match data:
    + identity of team 1
    + identity of team 2
    + goals scored by team 1
    + goals scored by team 2


In [None]:
import numpy as np
import pandas as pd
matches = pd.read_csv('womens_world_cup_2019.csv')
matches.head(7)

In [None]:
countries = pd.read_csv('country_prior.csv')

In [None]:
print(countries.head())
print('...')
print(countries.tail())

## Stan code

In [None]:
from cmdstanpy import cmdstan_path, CmdStanModel

model_wwc = CmdStanModel(stan_file='worldcup_pydata.stan')
model_wwc.compile()
print(model_wwc.code())

In [None]:
wwc_fit=model_wwc.sample(data='wwc_2019.data.json')
wwc_fit.summary().round(decimals=2)

### Estimate of per-team ability

In [None]:
# work with summary
wwc_summary = wwc_fit.summary().round(decimals = 2)
rownames = wwc_summary.index.tolist()


ability_filter = [param for param in rownames if param.startswith('ability')]
abilities = wwc_summary.loc[ability_filter]
names = abilities.index.tolist()
renames = countries['country'].tolist()
mapping = dict(zip(names, renames))
abilities.rename(index=mapping, inplace=True)
abilities.iloc[:,3:6]

## Posterior Predictive Check

In `generated quantities` block, replicate observed data `y` as `y_rep`

```
generated quantities {
  // posterior predictive check
  // replicate outcome based on the current estimate of our parameters
  vector[N] y_rep;
  for (n in 1:N) {
    y_rep[n] = normal_rng(ability[team_1[n]] - ability[team_2[n]], sigma_y);
  }
}
```


In [None]:
yrep_filter = [param for param in rownames if param.startswith('y_rep')]
yreps = wwc_summary.loc[yrep_filter]
names = yreps.index.tolist()

yreps.iloc[:,3:6]

##### Plot per-match replicates, showing 5% to 95% credible interval, mean (black), actual estimate (red)


In [None]:
# custom plotting, thanks to PyLadies crew!
# find credible intervals
import matplotlib.pyplot as plt
from coefplot import coefficient_plot
yrep_ci= pd.DataFrame({'midway': yreps['50%'].values,
                       'names': matches['match_list']})


In [None]:
yrep_ci.loc[:, 'left'] = yreps['5%'].values
yrep_ci.loc[:, 'right'] = yreps['95%'].values
yrep_ci

In [None]:
ys = np.array([matches['score_1'][i]-matches['score_2'][i]  for i in range(len(matches))])
ys

In [None]:
coefficient_plot(yrep_ci['midway'], yrep_ci['left'], 
                 yrep_ci['right'], ys,
                 names=yrep_ci['names'],
                 title='Match Score Differentials, 5%-95% CI, black = mean, red = actual', 
                fig_size = (8,12))
plt.tight_layout()

_note:  South Korea SPI rank was 13 out of 24 teams - should have done better_