## Week 8 Lecture 2 - Hierarchical Models II

McElreath's lectures for today: https://www.youtube.com/watch?v=n2aJYtuGu54&list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN&index=14

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/stat_rethinking_2022

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

An excellent port to Python/PyMC Code is available here: https://github.com/pymc-devs/resources/tree/master/Rethinking

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import pymc as pm
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import arviz as az
from matplotlib import pyplot as plt
import dataframe_image as dfi


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)


def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def indexall_(L):
    Il, Ll = pd.factorize(L, sort=True)
    return Ll, Il

# Hierarchical models

We're going to visit some other features of hierarhical models by looking again at the chimps data for prosocial behaviour, back to Week 6.

Let's import that data

In [None]:
cdata = pd.read_csv('chimpanzees.csv', sep=";")
# Actor index
Ia = cdata.actor.values-1
# Treatment variable
cdata['treatment'] = 1 + cdata.prosoc_left + 2*cdata.condition
# Grab data
# Left pull - response
L = cdata.pulled_left.values
# Individual chimps
Actor,Ia = indexall(cdata.actor.values)
Chimp = ['Chimp '+str(a) for a in Actor]
nchimps = len(Actor)
# Treatment
Treatment,It = indexall(cdata.treatment.values)
Treatment = ['R/N','L/N','R/P','L/P']
ntreat = len(Treatment)
# Block
Block,Ib = indexall(cdata.block.values.astype(str))
nblock = len(Block)

Importantly for this example, we have, in addition to the actors (chimps) and treatments, an additional covariate `block` that represents observations from the same day. This is a kind of **nussiance parameter** something we're not actually interested in but should account for as perhaps there was that Wednesday they played sad music all day in the chimp prison. We can accomodate this new additional factor using it's own *random effect*

In [None]:
dfi.export(cdata.head(), 'cdata.jpg')
cdata.head()

In [None]:
# Original Chimps model
with pm.Model(coords={'Chimp':Chimp, 'Treat': Treatment}) as Chimps:
    # Average chimp
    γ0 = pm.Normal('Actor_mean', 0, 1.5)
    σ_γ = pm.Exponential('sigma_actor', 1.)
    
    # Individual intercepts
    β0 = pm.Normal('Actor', γ0, σ_γ, dims='Chimp')
    
    # Treatment effects
    β1 = pm.Normal('Treatment', 0, 0.5, dims='Treat')

    # Linear model
    p = pm.invlogit(β0[Ia]+β1[It])

    # Likelihood
    Yi = pm.Binomial('Yi', 1, p,observed=L)

In [None]:
# New Chimps model with random effects for block
with pm.Model(coords={'Chimp':Chimp, 'Treat': Treatment, 'Block_':Block}) as ChimpsB:
    # Average chimp
    γ0 = pm.Normal('Actor_mean', 0, 1.5)
    σ_γ = pm.Exponential('sigma_actor', 1.)
    
    # Individual intercepts
    β0 = pm.Normal('Actor', γ0, σ_γ, dims='Chimp')
    
    # Treatment effects
    β1 = pm.Normal('Treatment', 0, 0.5, dims='Treat')
    
    # Block effects
    σ_β2 = pm.Exponential('sigma_block', 1.)
    β2 = pm.Normal('Block', 0, σ_β2, dims='Block_')

    # Linear model
    p = pm.invlogit(β0[Ia]+β1[It]+β2[Ib])

    # Likelihood
    Yi = pm.Binomial('Yi', 1, p,observed=L)

In [None]:
with Chimps:
    trace_c = pm.sample(1000)
with ChimpsB:
    trace_b = pm.sample(1000)

With these models in hand, we can take a look at plots of the posterior effects to see how they compare:

In [None]:
axes = az.plot_forest([trace_c, trace_b], model_names=["No Block", "Block"], var_names=['Actor','Treatment','Block'], figsize=(9,9))
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('chimpforest.jpg',dpi=300);

In [None]:
pm.summary(trace_b)

In [None]:
_, ax = plt.subplots(1, 1, figsize=(5, 5))
az.plot_kde(trace_b.posterior['Actor'].values, ax=ax)
az.plot_kde(trace_b.posterior['Block'].values, ax=ax, plot_kwargs={'color':'C1'})
ax.text(2, 0.75, "actor", color='C0')
ax.text(0.5, 2, "block", color='C1')
ax.set_xlabel('sigma')
ax.set_ylabel('density')
ax.set_xlim(-0.1, 4.1)
ax.set_ylim(-0.05, 3.5)
plt.savefig('chimpvar.jpg',dpi=300);

What's clear from the forrest plot and the density plot above is that the inter-actor effects are way more influential than any day to day variation. Looking at WAIC suggestst they're roughly equivalent

In [None]:
with Chimps:
    pm.compute_log_likelihood(trace_c)
with ChimpsB:
    pm.compute_log_likelihood(trace_b)
comp_df = az.compare({'No Block':trace_c, 'Block':trace_b}, ic='waic', scale='deviance')
dfi.export(comp_df.style.background_gradient(), 'chimpwaic.jpg')
comp_df

So what should we do? Nothing - in fitting both models we've learned something about the data, that block has almost no effect and the block parameters being near zero tell us why. Each model makes nearly identical out of sample predictions. While model selection has value in looking at the conditional independices of different causal (or mechanistic) models, these are experiments so there's nothing to select.


While the addition of block represents one kind of additional factor, we can go all in and at the partial pooling hierhiarcy so that we can estimate partial pooling effects of actor and treatment as well

In [None]:
with pm.Model(coords={'Chimp':Chimp, 'Treat': Treatment, 'Block_':Block}) as ChimpsT:
    # Average chimp
    γ0 = pm.Normal('Actor_mean', 0, 1.5)
    σ_γ = pm.Exponential('sigma_actor', 1.)
    # Individual intercepts
    β0 = pm.Normal('Actor', γ0, σ_γ, dims='Chimp')
    
    # Treatment effects
    σ_β1 = pm.Exponential('sigma_treat', 1.)
    β1 = pm.Normal('Treatment', 0, σ_β1, dims='Treat')
    
    # Block effects
    σ_β2 = pm.Exponential('sigma_block', 1.)
    β2 = pm.Normal('Block', 0, σ_β2, dims='Block_')

    # Linear model
    p = pm.invlogit(β0[Ia]+β1[It]+β2[Ib])

    # Likelihood
    Yi = pm.Binomial('Yi', 1, p,observed=L)

In [None]:
with ChimpsT:
    trace_t = pm.sample(1000)
with ChimpsT:
    pm.compute_log_likelihood(trace_t)

In [None]:
pm.summary(trace_t)

In [None]:
axes = az.plot_forest([trace_c, trace_b, trace_t], model_names=["No Block", "Block", "Full"], var_names=['Actor','Treatment','Block'], figsize=(9,9))
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('chimpforest2.jpg',dpi=300);

These resuls are very similar but now we have estimates of how variable things are among actors, treatments, and blocks.

In [None]:
_, ax = plt.subplots(1, 1, figsize=(5, 5))
az.plot_kde(trace_t.posterior['sigma_actor'].values, ax=ax)
az.plot_kde(trace_t.posterior['sigma_treat'].values, ax=ax, plot_kwargs={'color':'C2'})
az.plot_kde(trace_t.posterior['sigma_block'].values, ax=ax, plot_kwargs={'color':'C1'})
ax.text(2, 0.75, "actor", color='C0')
ax.text(0.75, 1.3, "treatment", color='C2')
ax.text(0.5, 2, "block", color='C1')
ax.set_xlabel('sigma')
ax.set_ylabel('density')
ax.set_xlim(-0.1, 4.1)
ax.set_ylim(-0.05, 3.5)
plt.savefig('chimpvar2.jpg',dpi=300);

In [None]:
comp_df = az.compare({'No Block':trace_c, 'Block':trace_b, 'FullH':trace_t}, ic='waic', scale='deviance')
dfi.export(comp_df.style.background_gradient(), 'chimpwaic2.jpg')
comp_df

# Divergent transitions

Common particularly in hierarhical models, divergent transitions mean our HMC algorithm isn't exploring the parameter space properly. The details (and what to do about them) were worked out by [Michael Betancourt and Mark Girolami](https://arxiv.org/pdf/1312.0906). To understand what's happening let's take a look at a small example

In [None]:
with pm.Model() as devil:
    v = pm.Normal('v',0,3)
    x = pm.Normal('x',0,pm.math.exp(v))

In [None]:
with devil:
    trace_d = pm.sample(1000)

In [None]:
pm.summary(trace_d)

What's going on here, well as written, at very low values of $v$ the distribution of x contracts around zero (the steep slopes in the likelihood surface) and this creates a specific problem: steep surfaces are hard to simulate using the discrete, leapfrog steps inherent in our HMC algorithm (remember back to week 5, lecture 2 where we outlined the number of leapfrog steps and the step sizes). What happens is that if the steps are too big (and they will be in the narrow funnel area, but not elsewhere) the simulation will somtimes fly wildly far off at one of the steps and the total energy at the start and end of the HMC step are not equal. In other words, a divergent transition. 

<img src="funnell.png" alt="drawing" width="500"/>

So what can we do? Well what's crazy is that we can make a small multiplicative adjustment to move our $v$ parameter out from the definition of $x$, using a standard normal

In [None]:
with pm.Model() as devilC:
    v = pm.Normal('v',0,3)
    z = pm.Normal('z',0,1)
    x = z*pm.math.exp(v)

In [None]:
with devilC:
    trace_c = pm.sample(1000)

Et voila - divergences solved. 


<img src="zfunnel.png" alt="drawing" width="500"/>

But what the heck just happened? Well, much like the z-scoring that we've been using throughout the course, in this new **non-centred** parameterization we're now sampling from the z-score $N(0,1)$ scale, rather than the $N(0,e^{v})$ scale, and then just reversing the z-scoring to get x. Recall

$$
z = \frac{x-\bar{x}}{SD(x)}
$$

so to solve for x:


$$
zSD(x) = x-\bar{x} = x-0 = x
$$

If the mean wasn't 0 for x we could add that in here too. In essence, by z-scoring we have fattened out the funnel so that the algorithm keeps sampling properly. 

Let's apply this lesson to the Chimps example above and see what happens

In [None]:
with pm.Model(coords={'Chimp':Chimp, 'Treat': Treatment, 'Block_':Block}) as ChimpsNC:
    # Average chimp
    γ0 = pm.Normal('Actor_mean', 0, 1.5)
    σ_γ = pm.Exponential('sigma_actor', 1.)
    za = pm.Normal('z_actor',0,1, dims='Chimp')
    
    # Old parameterization
    #β0 = pm.Normal('Actor', γ0, σ_γ, dims='Chimp')
    # Individual intercepts - non-centred
    β0 = pm.Deterministic('Actor', γ0+za*σ_γ, dims='Chimp')
    
    # Treatment effects
    β1 = pm.Normal('Treatment', 0, 0.5, dims='Treat')
    
    # Block effects - non-centred
    σ_β2 = pm.Exponential('sigma_block', 1.)
    zb = pm.Normal('z_block',0, 1, dims='Block_')
    β2 = pm.Deterministic('Block', zb*σ_β2, dims='Block_')

    # Linear model
    p = pm.invlogit(β0[Ia]+β1[It]+β2[Ib])

    # Likelihood
    Yi = pm.Binomial('Yi', 1, p,observed=L)

In [None]:
with ChimpsNC:
    trace_nc = pm.sample(1000)

In [None]:
def pairplot_divergence(trace, basevar, targetvar, ax=None, divergence=True, color='C3', divergence_color='C2'):
    #theta = trace.get_values(varname=basevar, combine=True)[:, 0]
    theta = trace.posterior[basevar].values.flatten()
    logtau = trace.posterior[targetvar].values.flatten()
    if not ax:
        _, ax = plt.subplots(1, 1, figsize=(10, 5))
    ax.plot(theta, logtau, 'o', color=color, alpha=.5)
    if divergence:
        divergent = trace.sample_stats.diverging.values.flatten()
        ax.plot(theta[divergent], logtau[divergent], 'o', color=divergence_color)
    ax.set_xlabel(basevar)
    ax.set_ylabel(targetvar)
    ax.set_title('scatter plot between log('+targetvar+') and '+basevar);
    return ax

# A small wrapper function for displaying the MCMC sampler diagnostics as above
def report_trace(trace,basevar,targetvar,logscale=False):
    # plot the trace of log(tau)
    pm.plot_trace({targetvar: trace.posterior[targetvar].values.flatten()});

    # plot the estimate for the mean of log(τ) cumulating mean
    if logscale:
        logtau = np.log(trace.posterior[targetvar].values.flatten())
    else:
        logtau = trace.posterior[targetvar].values.flatten()
    mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))]
    plt.figure(figsize=(15, 4))
    #plt.axhline(0.7657852, lw=2.5, color='gray')
    plt.plot(mlogtau, lw=2.5)
    plt.ylim(0, 2)
    plt.xlabel('Iteration')
    plt.ylabel('MCMC mean of log('+targetvar+')')
    plt.title('MCMC estimation of log('+targetvar+')')
    plt.show()

    # display the total number and percentage of divergent
    divergent = trace.sample_stats.diverging.values.flatten()
    print('Number of Divergent %d' % divergent.nonzero()[0].size)
    divperc = divergent.nonzero()[0].size / len(trace) * 100
    print('Percentage of Divergent %.1f' % divperc)

    # scatter plot between log(tau) and theta[0]
    # for the identifcation of the problematic neighborhoods in parameter space
    pairplot_divergence(trace,basevar,targetvar);

In [None]:
report_trace(trace_b,'Actor_mean','sigma_block')
plt.savefig('ppd1.jpg',dpi=300);

Details of these figure functions aren't so important but you can see in the plot above that the green dots (the divergences) are clustered along the bottom, showing that they happen when values of `sigma_block` are small - classic divergence. Somtetimes divergences just happen too, so having a look is helpful

In [None]:
report_trace(trace_nc,'Actor_mean','sigma_block')
plt.savefig('ppd2.jpg',dpi=300);

As here they're not concentrated anywhere in particular, so are less of a concern.