# Model Extensions

## Improvements/Extensions

### 1. Time Dynamics

#### Extreme Time Dynamics

Given the dataset starts in 2012, and we fit until late 2015, it is highly likely that some teams may be completely different (in terms of manager, players, stadium) over the course of such a large time period.

It is likely that in late 2015 match results from 2012 are not as informative as match results from 2015.  We can account for this by weighting the match results by time.  We can assume that the weight of a match result is a function of the time difference between the match and the current time.  We can model this as follows:
\begin{equation}
    \text{weight} = \exp(-\lambda \times \text{time_diff})
\end{equation}
Where $\lambda$ is a hyperparameter that controls the rate at which the weight of a match result decays over time, and $\text{time_diff}$ is the time difference between the match and the current time.

We can include the above in the model by weighting the likelihood of the observed match results by the time difference between the match and the current time.
We may choose a possible small range of values for $\lambda$ and use cross-validation to determine the best value.

**Can we just pass lambda to scipy.optimize.minimize?**

Including lambda as a free parameter in the minimization process may result in failure to converge, or a slow convergence, we can instead use a grid search to find the best value of lambda, across a range of values.

#### Team Underlying Skill Level

On a smaller timescale (i.e within season), the underlying skill of teams is likely to dynamically change over time, there are numerous possible reasons for this:

In this section we will solely consider that the underlying skill level of teams can change over time.  We will not consider the other factors that can cause the skill level of teams to change.

We can assume that at time T the skill level of a team is a function of the skill level of the team at time T-1 and some random noise.  We can model this as follows:
\begin{equation}
    \text{skill}_t = \text{skill}_{t-1} + \epsilon
\end{equation}

Where $\epsilon \sim N(0, \sigma^2)$

We may wish to apply above to both the attack and defense ratings of teams. 

##### Extension

We could extend above by adding a parameter that controls the rate at which the skill level of teams changes over time.  We may allow for more drastic changes in skill level at the start of the season (i.e. due to transfers, new manager etc) and less drastic changes in skill level as the season progresses.

### 2. Motivation

Towards the end of the regular season (especially in the MLS) there are likely to be differing motivations for teams. There is no relegation in the MLS so teams that are unlikely/unable to make the playoffs may not be as motivated to win games. Conversely, teams that are on the cusp of making the playoffs may be more motivated to win games.  It is also possible that teams do not view winning the regular league season as a priority (unlike all other leagues in the world) and may be more focused on the playoffs.<br>
In practice this will lead to a team's skill level changing over time.

We can use current predictions of the probability of winning regular season league, making playoffs, and include this in the model.

### 3. Home Advantage

We can improve the modelling of home advantage.  We may want to have separate home advantage parameters for each conference, i.e. when the same conference is playing each other, and a value for when teams from different conferences are playing each other.

i.e.
\begin{equation}
    \text{home_advantage} = \begin{cases} 
        \text{home_advantage_same_conference} & \text{if teams in same conference} \\
        \text{home_advantage_diff_conference} & \text{if teams in different conferences}
    \end{cases}
\end{equation}

where 
\begin{equation}
    \text{home_advantage_same_conference} = \begin{cases} 
        \text{east_home_advantage} & \text{if both in eastern conference} \\
        \text{west_home_advantage} & \text{if both in western conference}
    \end{cases}
\end{equation}

We may wish to further expand upon above, there may be differing home advantages observed due to:
- Altitude of stadium (Colorado Rapids play at 5000 + feet)
- Weather conditions players are used to playing in
- Distance of travel (**It is likely that the analysis from question 2 is a proxy for differences in travel distance**)
- Time spent in stadium (i.e. familiarity with stadium and pitch dimensions)

#### Hierarchical Model

We could use a simple Hierarchical Model as a starting point for the home advantage model.  We could assume that the home advantage for each team is drawn from a normal distribution with mean $\mu$ and standard deviation $\sigma$.
\begin{equation}
    \text{home_advantage}_i \sim N(\mu, \sigma)
\end{equation}
Where $\mu$ and $\sigma$ are hyperparameters.  We could then model $\mu$ and $\sigma$ as follows:
\begin{equation}
    \mu \sim N(\mu_0, \sigma_0)
\end{equation}
\begin{equation}
    \sigma \sim \text{HalfCauchy}(\beta)
\end{equation}

We could expand upon above by adding levels to the hierarchical model, i.e. we could have a separate $\mu$ for each conference, and a separate $\mu$ for each team in the conference.  We could also add additional levels to the model to account for the other factors that may impact home advantage.


### 4. Underlying Model

In the basic model we assumed that home and away goals were independent, in practice the two teams goals are inherently correlated and psychological factors can play a role in the outcome of games. <br>
Most teams will be employing the "strategy" of winning games, however sometimes teams may be more focused on not losing games (due to significant differences in skill, conventional wisdom on the best result for a game etc), this can lead to different correlations between home and away goals depending on the fixture, and also the current score (game-state) of the game. <br>
Expanding upon above, once a team goes ahead/behind in a game there will likely be changes (both conscious and subconscious) in the way the team plays.  This can lead to differing correlations between home and away goals depending on the current score of the game. <br>
Conscious changes might include:
- Substitutions
- Formation changes
- Tactical changes
- Time wasting
- Pressing
- etc.
Subconscious changes might include:
- Players taking less risks
- Players being more cautious
- Players being more aggressive
- Players feeling more or less pressure

#### Possible Model Improvements

##### Remove Constant Baseline Goal Expectation

In the previous model we used the $\gamma$ parameter to model the baseline goal expectation.  In practice, we would not expect the baseline goal expectation to be constant, since different skill levels and tactical approaches will lead to different baseline goal expectations. The model is simplified below:

\begin{equation}
    \lambda_{i, j} = \exp(\text{attack}_i + \text{defense}_j + \text{home_advantage})
\end{equation}

Where $\text{attack}_i$ is the attack rating of team i, $\text{defense}_j$ is the defense rating of team j, and $\text{home_advantage}$ is the home advantage.

We still include sum to zero constraints on the attack and defense ratings.

\begin{equation}
    \sum_{i=1}^{n} \text{attack}_i = 0
\end{equation}
\begin{equation}
    \sum_{j=1}^{n} \text{defense}_j = 0
\end{equation}


We can introduce models that allow for correlation between home and away goals, some possible options here are:

##### Dixon Coles Model

It is likely the previous model will underestimate low score lines and overestimate high score lines.  This is likely due to some of the previously listed reasons.

The Dixon Coles model introduces a parameter $\rho$ that "corrects" the model for this, acting as a correlation/dependence parameter between home and away goals. 

##### Bivariate Poisson

We can use a Bivariate Poisson distribution to model the number of goals scored by each team.  The Bivariate Poisson distribution is a generalization of the Poisson distribution to two dimensions.  The Bivariate Poisson distribution is parameterized by three parameters $\lambda_1$ $\lambda_2$ and $\rho$ where $\lambda_1$ and $\lambda_2$ are the means of the two Poisson distributions and $\rho$ is the correlation between the two Poisson distributions.  The probability mass function of the Bivariate Poisson distribution is given by:
\begin{equation}
    P(X = x, Y = y) = \frac{e^{-(\lambda_1 + \lambda_2)}(\lambda_1^{x} \lambda_2^{y})}{x!y!} \sum_{k=0}^{\min(x, y)} \frac{x!y!\rho^k}{k!(x-k)!(y-k)!}
\end{equation}

Where $X$ and $Y$ are the number of goals scored by each team, and $x$ and $y$ are the number of goals scored by each team.  The correlation between the two Poisson distributions is given by $\rho$.

Advantages:
- Captures dependency between home and away goals
- Can easily be implemented

Disadvantages:
- Assumes that the number of goals scored by each team is Poisson distributed
- Only allows for positive dependency between home and away goals. (CHECK THIS?)

## Additional Data
- Manager information
- Lineup information for each game
- Stadium location and altitude
- Weather conditions for each game
- Travel information for each game (i.e. distance travelled, time spent travelling)
- Cup/internation fixtures (for fatigue, travelling, rotation etc)
- Event Data/Tracking Data (We can improve upon using observed goals to fit ratings models, i.e XG model)
    - Even goal information (i.e. penalty/non penalty) could improve the model
- Referee information (Penalty tendencies, card tendencies etc)
- League rules or rules changes
    - (i.e. introduction of VAR for red card, penalty decisions etc)
    - (i.e. change in number of substitutes allowed)
    - (i.e. drastic changes in stoppage time played)
- Market information (betting market information)

## Given Improvement

We now use the seasonal pattern detected in the total goal scores.  We modify the model to allow for this by extending from one gamma parameter to 12 gamma parameters, one for each month of the year, i.e.,
$$ X_k \sim \text{Poisson}(\lambda_k) $$
$$ Y_k \sim \text{Poisson}(\mu_k) $$

where

$$ \ln \lambda_k = \alpha_{i(k)} + \beta_{j(k)} + \gamma_m + \frac{\eta}{2} $$
$$ \ln \mu_k = \alpha_{j(k)} + \beta_{i(k)} + \gamma_m +  \frac{\eta}{2} $$



In [1]:
from smartodds.data_fetching.prepare_data import ExtendedModel, SimpleModel
from tabulate import tabulate

extended_model = ExtendedModel('merged_data.csv','data_mls_simset_predictions.csv')
extended_model.fit()
extended_model.add_predictions()
extended_eval=extended_model.evaluate()
comparison_eval=extended_model.evaluate_comparison_data()

Removed 0 matches from the test set.


In [2]:
print(
    f"Minimisation outcome: {extended_model.result.message}, is successful: {extended_model.result.success}, log likelihood: {-extended_model.result.fun}, number of iterations: {extended_model.result.nit}")

Minimisation outcome: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH, is successful: True, log likelihood: -3923.0644668025875, number of iterations: 26


## Model Comparison

In [3]:
simple_model = SimpleModel('merged_data.csv','data_mls_simset_predictions.csv')
simple_model.fit()
simple_model.add_predictions()
simple_eval=simple_model.evaluate()

Removed 0 matches from the test set.


### Fit Comparison

We cannot just compare the 2 likelihood values of the models since the models have different numbers of parameters.  We can start with a few approaches below, to assess fit, **BUT** we will prioritize evaluation of the models vs unseen data.


### Evaluation vs Unseen Data

The approach used below is unlikely to mirror a production model, where the model will constantly be fit when new data is received.  We will compare the 2 models by taking the fitted model and using it to predict the outcomes of the test set.  We will then compare the predicted outcomes to the actual outcomes, using a set of evaluation metrics.  We will use the following metrics:
- Log Loss: This is a common metric used in classification problems, it is the negative log likelihood of the model.  It is a measure of how well the model predicts the actual outcomes.  The lower the log loss the better the model.
- Brier Score: This is a common metric used in classification problems, it is the mean squared difference between the predicted probabilities and the actual outcomes.  The lower the Brier Score the better the model.
- Bias: This is the mean difference between the predicted probabilities and the actual outcomes.  The lower the bias the better the model.
- RMSE: We use the RMSE to compare to observed event counts.  The lower the RMSE the better the model.

We use probabilistic scores for evaluating the predicted outcomes, and we will use RMSE to evaluate the models against 'expected goals'. If we had probabilistic predictions for goals markets we could use the same metrics as above to evaluate the models against the goals markets.

In [8]:
print(f"Simple Model:")
simple_eval

Simple Model:


{'home_log_loss': np.float64(0.6979880473806059),
 'away_log_loss': np.float64(0.5083130615917607),
 'draw_log_loss': np.float64(0.6357780326785211),
 'mean_log_loss': np.float64(0.6140263805502958),
 'home_brier_score': np.float64(0.2523681504106389),
 'away_brier_score': np.float64(0.16299369352002988),
 'draw_brier_score': np.float64(0.22062311998451123),
 'mean_brier_score': np.float64(0.21199498797172667),
 'home_bias': np.float64(-0.0008390367891279646),
 'away_bias': np.float64(0.07170518515265985),
 'draw_bias': np.float64(-0.07086614836353189),
 'mean_bias': np.float64(0.0),
 'home_rmse': np.float64(0.5023625686798718),
 'away_rmse': np.float64(0.40372477446898136),
 'draw_rmse': np.float64(0.4697053544345766),
 'mean_rmse': np.float64(0.45859756586114325),
 'total_goals_rmse': np.float64(1.702947437469505)}

In [10]:
print(f"Extended Model:")
extended_eval

Extended Model:


{'home_log_loss': np.float64(0.6992760108210303),
 'away_log_loss': np.float64(0.5079846666185494),
 'draw_log_loss': np.float64(0.6382042229896407),
 'mean_log_loss': np.float64(0.6151549668097401),
 'home_brier_score': np.float64(0.2529960835743615),
 'away_brier_score': np.float64(0.16289828994945452),
 'draw_brier_score': np.float64(0.2214445959223487),
 'mean_brier_score': np.float64(0.21244632314872156),
 'home_bias': np.float64(-0.0007311687939184144),
 'away_bias': np.float64(0.07151714506620224),
 'draw_bias': np.float64(-0.07078597627228382),
 'mean_bias': np.float64(4.625929269271485e-18),
 'home_rmse': np.float64(0.5029871604468265),
 'away_rmse': np.float64(0.40360660295571793),
 'draw_rmse': np.float64(0.4705790007239472),
 'mean_rmse': np.float64(0.4590575880421639),
 'total_goals_rmse': np.float64(1.7112895588062458)}

In [11]:
print(f"Comparison Model:")
comparison_eval

Comparison Model:


{'home_log_loss': np.float64(0.6810377829938471),
 'away_log_loss': np.float64(0.479863620387127),
 'draw_log_loss': np.float64(0.6257773180681283),
 'mean_log_loss': np.float64(0.5955595738163675),
 'home_brier_score': np.float64(0.24401575054001173),
 'away_brier_score': np.float64(0.15115041887392827),
 'draw_brier_score': np.float64(0.2165265939016754),
 'mean_brier_score': np.float64(0.2038975877718718),
 'home_bias': np.float64(-0.001179942851744113),
 'away_bias': np.float64(0.06428779474892794),
 'draw_bias': np.float64(-0.06310785189718383),
 'mean_bias': np.float64(0.0),
 'home_rmse': np.float64(0.4939795041699723),
 'away_rmse': np.float64(0.3887806822283333),
 'draw_rmse': np.float64(0.46532418151400146),
 'mean_rmse': np.float64(0.44936145597076904),
 'total_goals_rmse': np.float64(1.699939464112186)}

#### Caveats
We cannot just compare the models values for say mean log loss and say for certain that one model is better than the other. We may choose to use a bootstrap approach to determine if the difference in the evaluation metrics is statistically significant.
We may encounter issues where one model performs better for one purpose (i.e. predict match result) and the other model performs better for another purpose (i.e. predict total goals).  We may need to make a decision on which model to use based on the purpose of the model.
If the end use of the model is to bet on the outcomes of games, we may wish to evaluate against available implied probabilities, however care needs to be taken since markets are always evolving and market accuracy may be improving over time.
