### Introduction

The goal is to use data consisting of the results for all games from the 2014 world cup to estimate team ability. 

The data used is before the final was played so the model can also be used to predict the eventual winner.

### Background 

Gelman's world cup model for the 2014 football world cup was written up as two posts. [Attempt 1](https://statmodeling.stat.columbia.edu/2014/07/13/stan-analyzes-world-cup-data/) built several models to estimate the difference in score between two teams at the world cup (well, actually the square root of the scores but we're going to ignore that as it was abandoned). This model had problems as predictions from the model made back onto the same data to which the model was fit matched poorly. In particular, when using the fitted model to estimate the 95% intervals for the score differences, more than 5% of the actual score differences fell outside of these 95% intervals. This was due to a bug in how Gelman implemented the model which he then fixed and wrote about in [attempt 2](https://statmodeling.stat.columbia.edu/2014/07/15/stan-world-cup-update/).

Gelman also has a nice [video](https://www.youtube.com/watch?v=T1gYvX5c2sM) where he introduces this model and problem.

### Data

The data consists of the scores for all games from the 2014 world cup up to but excluding the final and 3rd/4th place play-off.

Gelman also uses a version of the [Soccer Power Index](https://projects.fivethirtyeight.com/global-club-soccer-rankings/) for international teams which is treated in the model as a rank order (not a score). 

### The model

The score difference when team $i$ plays team $j$ is modelled as $y_{ij}$:

$$y_{ij} \sim \mathcal{t}_{\nu}\, (a_i - a_j, \sigma_y)$$

which is a student-t distribution with $\nu = 7$ degrees of freedom, location $a_i - a_j$ and shared scale $\sigma_y$. The choice of a student-t distribution and the degrees of freedom parameter is to capture rarer events than, say, a normal distribution (though the degrees of freedom parameter is set somewhat arbitrarily).

The skill-level parameter $a_i$ of the $i$th team is modelled as:

$$a_i \sim \mathcal{N}(\mu + b \, s_i, \sigma_a)$$

where $s_i$ is the Soccer Power Index for the $i$th team and is used as prior knowledge about the abilities of the teams. This is thus a hierarchical model with each $a_i$ coming from a population level ability distribution. 

However, we care about the *relative* ability of the teams and as $\mu$ is the average of all team's abilities and we are going to compare $a_i$ to $a_j$ for any $i$ and $j$ the $\mu$s will cancel and so we can just set it to 0.

Thus instead we think of each $a_i$ as the *relative ability* of a team (relative to the teams in the dataset) which is modelled as:

$$a_i \sim \mathcal{N}(b \, s_i, \sigma_a)$$

**What do the parameters represent?**

* $b$: the weight of the prior information in the Soccer Power Index. If this index contains no information we would expect $b$ to be close to 0.
* $a_i$: this is the relative ability of each team to the population. If the prior models the data perfectly then we would expect the distribution of each $a_i$ to be centered at $b \, s_i$. Otherwise it will differ on a per team basis depending on how much the data pulls it away from the piror.
* $\sigma_a$: the residual error in the estimates of team's abilities as even given the Soccer Power Index we do not expect to be able to estimate ability perfectly.
* $\sigma_y$: the observational residual error as given the relative abilities of the teams we still expect errors in our score difference estimates.