## Week 4 Lecture 2 - Model comparison

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/statrethinking_winter2019

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

An excellent port to Python/PyMC Code is available here: https://github.com/dustinstansbury/statistical-rethinking-2023

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pymc as pm
from matplotlib import pyplot as plt


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)


def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

## From information theory to model comparsion

The advances of Claude Shannon are huge, on par with those of Allen Turing or, some would argue, even Einstein, in that they came out of nowhere and were singular in their clear contribution to the modern world. A great article outlining his remarkable life can be found [here])https://spectrum.ieee.org/tech-history/cyberspace/claude-shannon-tinkerer-prankster-and-father-of-information-theory). You can grab the original paper [here](https://dl.acm.org/doi/pdf/10.1145/584091.584093?casa_token=0oyegsiCAF0AAAAA:xgmhPoVWitChIaYrP6ez36alK6jQpUCIBrScdLLSEk9brBveBYGl0B-pB8MByPz-DZSqM10Kw-_K). There is also a new documentary film: https://thebitplayer.com/

![](shannon.jpg)


However the path from information theory relating to how to encode things and send them over a wire and model comparison is a difficult one, so let's work carefully through how these things relate. 

# Information

Defining information is an obscure concept, but here we'll look at Shannon's defnition and see how it plays out. First, the big insight is to ask *how much is our uncertainty reduced once we learn an outcome*. To figure this out we need to unpack this statement a bit:

1. By *outcome* we mean some measurable thing that occurs as a result of some phenomenon

2. By *uncertainty* we mean here the range of guesses or possibilities as to what the next outcome will be when it appears

3. By *how much* we are asking for a measure or metric that quantifies a net change in uncertainty before and after an outcome is observed.

Confused? Ok, well let's start with item `3`, which is Shannon's measure of *information entropy*, which states that **the uncertainty (H) contained in a probability distribution is the average log-probability (p) of an event,** which can be expressed as

$$
H(p) = -\sum^{n}_{i=1}p_i log(p_i).
$$

In [None]:
# Information entropy function
def IE(p):
    return -sum(p*np.log(p))

So to make this concrete, let's say we have a model for the weather, where the true probabilites are 0.3 for rain ($p_1$) and 0.7 for sun ($p_2$). So the total entropy (or uncertainty) in this situation is:

In [None]:
# True probabilities of rain and sun
p = np.array([0.3, 0.7])

IE(p)

We can imagine however, a place where it either mostly rains (Glasgow):

In [None]:
# True probabilities of rain and sun
p = np.array([0.9, 0.1])

IE(p)

Or where it is mostly sunny (LA):

In [None]:
# True probabilities of rain and sun
p = np.array([0.1, 0.9])

IE(p)

Which leads to the question of what kind of place has the greatest uncertainty?

In [None]:
# True probabilities of rain and sun
p = np.array([0.5, 0.5])

IE(p)

This makes total sense, if it mostly rains or is sunny, there is less uncertainty; if it is 50/50, then who knows? Probability is amazing.

# Divergence

Now that we can measure the level of uncertainty inherent in a known probability distribution, the question is how can we use this to measure how well a model we might propose is from this truth (don't worry, we'll address the fact that we don't know the truth shortly)? Well, we can express the distance (in uncertainty units) between our model (q) and the true model (p) as *divergence*, expressed as the sum of the average distances between them

$$
D_{KL}(p,q) = -\sum^{n}_{i=1}p_i(log(p_i)-log(q_i)).
$$


In [None]:
# Kullback-Leibler divergence
def KLD(p,q):
    return sum(p*(np.log(p)-np.log(q)))

This measure is called the Kullback-Leibler divergence after [Solomon Kullback](https://en.wikipedia.org/wiki/Solomon_Kullback) and Richard Leibler, two crypto-analysts at the US National Security Agency who developed the measure in 1951. With the true model and our proposed model in hand, we can calculate the KL divergence for model:

In [None]:
# True probabilities of rain and sun
p = np.array([0.3, 0.7])

# Our ignorant guess as the true probabilites of rain and sun
q1 = np.array([0.5, 0.5])

# KL divergence
KLD(p,q1)

In [None]:
# Our best guess as the true probabilites of rain and sun
q2 = np.array([0.2, 0.8])

# KL divergence
KLD(p,q2)

So if we propose a model that's closer to the truth than a coin flip, the KL divergence gets smaller. A key nuance in this calculation is that divergence is not symmetric (this is the mars-earth example in the book). If we reverse the true and proposed models:

In [None]:
KLD(q2,p)

You can see that the level of surprise is lower. Why? Because with 0.2/0.8 as the 'true' model there is less 'surprise' ($D_{KL}=0.026$) in going from a less-certain set of conditions (earth, $q2=[0.2, 0.8]$) to a more-certain set of conditions (mars, $p=[0.3, 0.7]$) than there is ($D_{KL}=0.028$) in going from a more-certain set of conditions (mars, $p=[0.3, 0.7]$) to a less-certain set of conditions (earth, $q2=[0.2, 0.8]$). Nutty.

# Relative log-probability

Ok, we now have a measure of the information distance between our model and the truth. Well big deal - we'll never know the truth ($p$), so what use is this? Well, while we'll never know the truth, we can calculate the KL divergence for a bunch of models. How? Well when comparing two models we can assume the truth is constant, meaning we can just sub in something sensible ($x_i$) for the $p_i$ values

$$
D_{KL}(p,q) = -\sum^{n}_{i=1}x_i(log(x_i)-log(q_i)).
$$

Let's give this a try with our $q1$ and $q2$ models:

In [None]:
x = 1
(KLD(x,q1)-KLD(x,q2))/x

In [None]:
x = 2
(KLD(x,q1)-KLD(x,q2))/x

In [None]:
x = 6
(KLD(x,q1)-KLD(x,q2))/x

In [None]:
# Absolute difference in log-probabilites
abs(sum(np.log(q1))-sum(np.log(q2)))

In [None]:
# Absolute difference in log-probabilites
abs(sum(np.log(q2))-sum(np.log(q1)))

So it really doesn't matter what the truth is, we can still compare the relative KL divergence of two models through their log-probabilities. So the log-probability score, or some variant of it, is the basis of *information criteria* used to compare model fits.

# Deviance

In a Bayesian context, with more complex models, things start to become more effort to keep track of because rather than a single point estimate of probability for each observation, we have a distribution of probability. But that aside, calculating the log-probabilities for each point can be done, through the *log-pointwise predictive density*:

$$
lppd(y|\Theta) = \sum^{}_{i}log\frac{1}{S}\sum_{s}p(y_i|\Theta_s)
$$

which is simply calculating the log-probabilty of the data ($y$) given the set of parameters ($\theta$) in the current iteration ($s$) of the sampler. We can do this for a set of data:

In [None]:
# Import data
ddata = pd.read_csv('WaffleDivorce.csv',";")
# Display top 5 rows
ddata.head()

In [None]:
# Standardize variables
A = stdize(ddata.MedianAgeMarriage.values)
M = stdize(ddata.Marriage.values)
D = stdize(ddata.Divorce.values)
W = stdize(ddata.WaffleHouses.values)
S = stdize(ddata.South.values)

In [None]:
# Conditioning on Southerness
with pm.Model() as Smod:
    # Baseline intercept
    β0 = pm.Normal('Intercept', 0, 0.2)
    # Waffle House effect
    β1 = pm.Normal('WH', 0, 0.5)
    # Southern
    β2 = pm.Normal('S', 0, 0.5)
    # Linear model
    μ = pm.Deterministic('mu',β0+β1*W+β2*S)
    # Error
    σ = pm.Uniform('SD_obs', 0, 10)
    # Likelihood
    Yi = pm.Normal('Yi', μ, σ, observed=D)

In [None]:
# Using A and M
with pm.Model() as AMmod:
    # Baseline intercept
    β0 = pm.Normal('Intercept', 0, 0.2)
    # Waffle House effect
    β1 = pm.Normal('WH', 0, 0.5)
    # Marriage age
    β2 = pm.Normal('A', 0, 0.5)
    # Marriage rate
    β3 = pm.Normal('M', 0, 0.5)
    # Linear model
    μ = pm.Deterministic('mu',β0+β1*W+β2*A+β3*M)
    # Error
    σ = pm.Uniform('SD_obs', 0, 10)
    # Likelihood
    Yi = pm.Normal('Yi', μ, σ, observed=D)

In [None]:
with Smod:
    trace_s = pm.sample(1000)
with AMmod:
    trace_am = pm.sample(1000)

By adding the `Deterministic` node, we're keeping track of mu for each observation, at each iteration:

In [None]:
D

In [None]:
# Expected value for mu at step 1
trace_s.posterior['mu'][0][0].values

Along with the current value for sigma:

In [None]:
# Expected value for sigma at step 1
trace_s.posterior['SD_obs'][0].T[0].values

we can then substitute these into the data likelihood

In [None]:
sp.stats.norm.pdf(D, trace_s.posterior['mu'][0][0].values, trace_s.posterior['SD_obs'][0].T[0].values)

to get the probability of each observation, given a normal distribution and the current expected values and sigma. In other words, the likelihood.

Now the trick is to keep track of all these so that we can do the lppd calculation for each data point. First we iterate over the full trace to get these pointwise probabilities at each iteration:

In [None]:
trace_s.posterior['SD_obs'].shape[1]

In [None]:
# Number of iterations
nsamp = trace_s.posterior['SD_obs'].shape[1]
# Number of datapoints
nobs = len(D)
# Empty matrix to hold values
Slppd_ = np.zeros(shape=(nsamp,nobs))

# Loop over iterations to grab pointwise likelihoods
for i in range(nsamp):
    Slppd_[i] = sp.stats.norm.pdf(D, trace_s.posterior['mu'][0][i], trace_s.posterior['SD_obs'][0].T[i])


We can now do the calculation 


$$
lppd(y|\Theta) = \sum^{}_{i}log\frac{1}{S}\sum_{s}p(y_i|\Theta_s)
$$

using the values for each datapoint, and getting the log of their avergae for the Southern model:

In [None]:
Slppd = np.log(Slppd_.T.mean(1)).sum()
Slppd

This notation allows us to transpose Slppd, so that each row corresponds to the trace for each datapoint, then to take the mean of each row (the `1`) and sum them together. We can do the same process for the Marriage age/rate model:

In [None]:
# Empty matrix to hold values
AMlppd_ = np.zeros(shape=(nsamp,nobs))

# Loop over iterations to grab pointwise likelihoods
for i in range(nsamp):
    AMlppd_[i] = sp.stats.norm.pdf(D, trace_am.posterior['mu'][0][i], trace_am.posterior['SD_obs'][0].T[i])


In [None]:
AMlppd_

In [None]:
AMlppd = np.log(AMlppd_.T.mean(1)).sum()
AMlppd

for the marriage age and marriage rate model. Higher values here are better (they're more accurate), so the AM model has more support. Multiplying these values by -2 gives us model **deviance**, with smaller values (less deviant) being better:

In [None]:
Slppd*-2

In [None]:
AMlppd*-2

# AIC

AIC - that is Akaike's Information Criterion - has a long history in model comparision. It is defined as two times the number of parameters in the model minus the deviance:

$$
AIC = 2k-2log(\hat{L})
$$

with $\hat{L}$ being the maximum likelihood. In our case we can sub in the deviance value for the $-2log(\hat{L})$.

In [None]:
# AIC for Southern model
2*4-2*Slppd

In [None]:
# AIC for marriage model
2*5-2*AMlppd

So what's happened here? We need some sort of penalty for having added those extra parameters. The $2k$ penalty does this, telling us that the Marriage model has more support, given the data and considering the numbers of parameters used.

All the various information criteria do some version of this, with various improvements over time. AIC was cutting edge 20 years ago, but has been completely replaced by WAIC (the widely-applicable information criteria), so called because it is more generalized. WAIC has an extra bit which is to use a penalty term proportional to the variance in the posterior predictions:


$$
WAIC = -2(lppd-\sum_{i}var_\theta log(y_i|\theta))
$$


To do this by hand, we need to go back to the lppd matrix and store the variance of the posterior probabilities

In [None]:
# Variance in log probabilites for each observation
Slppd_sd = sum(np.log(Slppd_.T).std(1)**2)
AMlppd_sd = sum(np.log(AMlppd_.T).std(1)**2)
Slppd_sd,AMlppd_sd

Then we can do the other bits

In [None]:
# Southern model WAIC
WAICs = -2*(Slppd-Slppd_sd)
# Marriage model WAIC
WAICam = -2*(AMlppd-AMlppd_sd)

WAICs,WAICam

Which shows again that the marriage model has more support, given the data. 

Incidentally, PyMC does this calculation for you, using the `pm.waic()` function:

In [None]:
pm.waic(trace_s, scale='deviance')

In [None]:
pm.waic(trace_am, scale='deviance')

## Loo-CV

The information based criteria above are but one choice for assessment of relative model fits by scoring their overfitting risk. Another is cross-validation, the omission of one (or more) datapoints that are iteratively compared with their predicted values given a particular model. The average out of sample performace is, as it turns out, a good representation of the log-score of a model. What does this look like? Well similar to the lppd calculation above, it is the deviation between the single dropped observation ($y_i$) and the parameters estimated from the data that excluded $y_i$:

$$
lppd_{CV} = \sum^{}_{i}log\frac{1}{S}\sum_{s}p(y_i|\Theta_{-i,s})
$$

However this is computationally expensive to calculate - the number of datapoints times the number of iterations - so [Aki Vehtari](https://users.aalto.fi/~ave/) came up with Pareto-smoothed importance sampling (PSIS) as a very good approximation. It weights each sample by the inverse probabilty of the omitted observation, then takes their normalized sum as a new value, $lppd_{IS}$. As it turns out, the distribution of the largest weights calculated for each $y_i$ should have a [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution), with each $y_i$ having an estimated Pareto $k$ parameter. For observations with $k>0.7$, there is evidence the Pareto distribution is failing and that the observation is highly inflential, given the proposed model.

In [None]:
pm.loo(trace_s, scale='deviance')

In [None]:
pm.loo(trace_am, scale='deviance')

# So, model comparison...

So now we understand the derivation of information criteria - what should we do with it? Model selection! Many people use information criteria for this but hold on - use of information criteria depends on your objectives. Information criteria is a measure of model fit - it has nothing to do with causal inference. Let's look back at our Waffle House models, the results of which are:



In [None]:
pm.summary(trace_s,var_names=['WH','S'])

In [None]:
pm.summary(trace_am,var_names=['WH','A','M'])

Showing that the Marriage model still sees it as plausible that Waffle Houses have a positive effect on divorce rates, which is silly, and yet if we look at WAIC:

In [None]:
WAICs

In [None]:
WAICam

It would greatly favour the Marriage model. Simple eh?


Having spent time to outline this level of detail, please watch McElreath's lecture to see these calculations in context: https://www.youtube.com/watch?v=gjrsYDJbRh0