## Week 9 Lecture 1 - Adventures in covariance

McElreath's lectures for today: https://www.youtube.com/watch?v=XDoAglqd7ss&list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN&index=15

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/stat_rethinking_2022

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

An excellent port to Python/PyMC Code is available here: https://github.com/pymc-devs/resources/tree/master/Rethinking

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import pymc as pm
import arviz as az
from matplotlib import pyplot as plt
from matplotlib.patches import Ellipse, transforms
import dataframe_image as dfi


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)


def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def indexall_(L):
    Il, Ll = pd.factorize(L, sort=True)
    return Ll, Il

def Gauss2d(mu, cov, ci, ax=None, ec='k'):
    """Copied from statsmodel"""
    if ax is None:
        _, ax = plt.subplots(figsize=(6, 6))

    v_, w = np.linalg.eigh(cov)
    u = w[0] / np.linalg.norm(w[0])
    angle = np.arctan(u[1] / u[0])
    anglex = 180 + 180 * angle / np.pi  # convert to degrees
    for level in ci:
        v = 2 * np.sqrt(
            v_ * sp.stats.chi2.ppf(level, 2)
        )  # get size corresponding to level
        ell = Ellipse(mu[:2], v[0], v[1], angle=anglex, 
                      facecolor="None", edgecolor=ec, alpha=(1 - level) * 0.5,
            lw=1.5,
        )
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(0.5)
        ax.add_artist(ell)

    return ax

# Varying intercepts, varying slopes

In linear models, we have intercepts, which describe means of groups, and we have slopes, that characterize effects of covariates. It's important to recognize that both of these (as well as their variances) can be built up hierarchicaly:

1. *Varying intercepts* - means vary by cluster
2. *Varying slopes* - effects of predictors vary by cluster
3. *Varying variances* - variabitliy of slopes or intercepts vary by cluster

And so any parameter can be a varying effect, which is simply split into a vector of parameters by cluster that have their own parent distribution.

To make this concrete, let's simulate some data relating to waiting times in European cafes, where the cafes have inherrent variation in waiting times (intercepts), as well as in how much that waiting time varies from morning to afternoon (slopes). What makes this an adventure in covariance is that we're going to assert, in our simulation, that the intercepts and slopes are correlated in a predictible way ($\rho=-0.7$):

In [None]:
# Average (Global) morning waiting time
g0 = 3.5
# Average (Global) difference in afternoon waiting time
g1 = -1.
# Standard deviation of cafe-level variation in intercepts
sigma_g0 = 1.2
# Standard deviation of cafe-level variation in slopes
sigma_g1 = 0.5
# Correlation between cafe intercepts and slopes
rho = -0.7

These are our global known parameters, applicable to any cafe in Europe. Now we can take a sample by building up the covariance matrix:

$$
\left(\begin{array}{cc}
\sigma^{2}_{g0} & \sigma_{g0}\sigma_{g1}\rho \\
\sigma_{g0}\sigma_{g1}\rho & \sigma^{2}_{g1}
\end{array}\right)
$$


In [None]:
# Build covariance matrix
sigmas = [sigma_g0, sigma_g1]
Rho = np.matrix([[1,rho],[rho,1]])
COV = np.diag(sigmas)*Rho*np.diag(sigmas)
COV

and simulating from a multivariate normal

In [None]:
# Number of cafes
ncafe = 20
# Cafe-level parameters
cafe_effects = np.random.multivariate_normal([g0,g1], COV, size=ncafe)
b0 = cafe_effects[:, 0]
b1 = cafe_effects[:, 1]

Now we have cafe-level intercepts and slopes that we know are negatively correlated by $-0.7$, which we can visualize in a bivariate plot

In [None]:
# Set up plot size
_, ax = plt.subplots(1, 1, figsize=(5, 5))
# Fancy crcles to plot MvNormal
Gauss2d([g0,g1] , np.asarray(COV), [0.1, 0.3, 0.5, 0.8, 0.99], ax=ax, ec='k')
# Scatterplot of values
ax.scatter(b0, b1)
ax.set_xlim(1.5, 6.1)
ax.set_ylim(-2, 0)
ax.set_xlabel('Mornings (b0)')
ax.set_ylabel('Afternoons (b1)')
plt.tight_layout()
plt.savefig('cafesim.jpg',dpi=300);

Ok, with this in place, we can now simulate some days of observation among our cafes:

In [None]:
# Number of observed days per cafe
nvisit = 10

# am/pm indicator
PM = np.tile([0,1], nvisit*ncafe//2)

# Keep track of cafe
Ic = np.repeat(np.arange(0, ncafe), nvisit)

# Expected value per cafe
mu_ = b0[Ic]+b1[Ic]*PM

# Variation within cafes
sigma_ = 0.5

# Observed wait times
wait = np.random.normal(mu_, sigma_, size=nvisit*ncafe) 

With data in place, we can now build our model, which based on what we've simulated, should be

$$
\begin{align}
W_i \sim & N(\mu_i, \sigma)\\
\mu_i = & \beta_{0}+\beta_{1}PM \\
\left[\begin{array}{c}
\beta_{0} \\
\beta_{1} 
\end{array}\right] = & MvN \left(\left[\begin{array}{c}
\gamma_{0} \\
\gamma_{1} 
\end{array}\right], S \right) \\
S = & \left(\begin{array}{cc}
\sigma_{g0} & 0 \\
0 & \sigma_{g1} 
\end{array}\right) R \left(\begin{array}{cc}
\sigma_{g0} & 0 \\
0 & \sigma_{g1} 
\end{array}\right) \\
\gamma_0 \sim & N(5,2) \\
\gamma_1 \sim & N(-1,0.5) \\
\sigma, \sigma_{g0}, \sigma_{g1} \sim & Exp(1) \\
R \sim & LKJcorr(2)
\end{align}
$$

There's a lot here - the first bit is the multivariate normal structure, as outlined when we simulated the data above. The middle bit should be familiar, but the key bit comes at the end, the $LKJcorr(2)$, what's that? Well it is something called the LKJ distribution, named for [Lewandowski, Kurowicka, and Joe (2009)](https://www.sciencedirect.com/science/article/pii/S0047259X09000876) who came up with it. What it is is a good general prior for Bayesian multivariate normal distributions. What does it look like? Well let's take a look by simulating some

In [None]:
_, ax = plt.subplots(1, 1, figsize=(5, 5))
textloc = [[0, .56], [0, .8], [.4, .9]]
for eta, loc in zip([1, 2, 4], textloc):
    R = pm.LKJCorr.dist(n=2, eta=eta, size=1000).eval()
    sns.kdeplot(data=R);
    plt.text(loc[0], loc[1], 'η = %s'%(eta), horizontalalignment='center')

ax.set_ylim(0, 1.1)
ax.set_xlim(-1,1)
ax.set_xlabel('Correlation')
ax.set_ylabel('Density')
plt.tight_layout()
plt.savefig('LKJcorr.jpg',dpi=300);

So the η parameter guides the level of potential correlation, between -1 and 0. Larger η values correspond to values more skeptical of highly negative or positive correlations. 

With all this in place, we can now build our hierarhical model, using the multivariate normal 

In [None]:
?pm.expand_packed_triangular

In [None]:
with pm.Model() as MvCafes:
    # Hyperprior for LKJ
    sd_dist = pm.HalfCauchy.dist(2)
    chol, corr, sigmas = pm.LKJCholeskyCov('chol_cov', eta=2, n=2, sd_dist=sd_dist)
    
    # Covariance matrix
    cov = pm.math.dot(chol, chol.T)
    
    # Keep track of sigmas
    Sigs = pm.Deterministic('sigma_cafe', sigmas)
    
    # Extract the standard deviations and rho so we can see how clever we are later
    r = pm.Deterministic('Rho', corr[np.triu_indices(2, k=1)])
    
    # Global intercept
    γ0 = pm.Normal('global_intercept', 5, 2)
    # Global slope
    γ1 = pm.Normal('global_slope', -1, 0.5)
    
    # Cafe intercepts and slopes
    β = pm.MvNormal('ab_cafe', mu=[γ0, γ1], chol=chol, shape=(ncafe, 2))
    
    # Linear model
    μ = β[:, 0][Ic]+β[:, 1][Ic]*PM
    
    # Data likelihood
    σ = pm.HalfCauchy('sigma', beta=2)
    Yi = pm.Normal('Yi', μ, σ, observed=wait)

In [None]:
with MvCafes:
    trace_c = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_c, var_names=['global_intercept','global_slope','chol_cov','sigma', 'Rho'])
dfi.export(tmp.style.background_gradient(), 'cafemodel.png')
tmp

First, let's take a look at how we did in estimating ρ

In [None]:
az.plot_posterior(trace_c, var_names=['Rho'])
sns.kdeplot(data=pm.LKJCorr.dist(n=2, eta=eta, size=1000).eval(),linestyle=':',label='Prior')
plt.axvline(rho,c='red')
plt.title('')
plt.legend()
plt.xlim(-1,1)
plt.tight_layout()
plt.savefig('rho.jpg',dpi=300);

Not terrible, off by a bit.

So what else have we gained through use of our bivariate normal? Well shrinkage of course, but now in two dimensions:

In [None]:
# Create dataframe to use pandas
df = pd.DataFrame(dict(cafe=Ic , pm=PM , wait=wait))
# Unpooled estimates
obs_mean = (df.groupby(['pm', 'cafe']).agg('mean').unstack(level=0).values)
b0obs = obs_mean[:, 0]
b1obs = obs_mean[:, 1] - b0obs

In [None]:
# Partial pooled estimates
b0_ = trace_c.posterior['ab_cafe'].values[0].mean(0).T[0]
b1_ = trace_c.posterior['ab_cafe'].values[0].mean(0).T[1]

In [None]:
b0_

In [None]:
b1_

In [None]:
# Posterior bivariate Normal SD
chol_model = pm.expand_packed_triangular(2, trace_c.posterior['chol_cov'].values[0].mean(0),lower=True).eval()
Sigma_est = np.dot(chol_model, chol_model.T)

# Global posterior bivariate Normal slope and intercept
Mu_est = [trace_c.posterior['global_intercept'].mean(),trace_c.posterior['global_slope'].mean()]

In [None]:
# Draw contours and show shrinkage
_, ax = plt.subplots(1, 1, figsize=(8, 8))
Gauss2d(Mu_est, np.asarray(Sigma_est), [0.1, 0.3, 0.5, 0.8, 0.99], ax=ax)
ax.scatter(b0obs, b1obs, label='Obs')
ax.scatter(b0_, b1_, facecolors='none', edgecolors='k', lw=1, label='Est')
ax.plot([b0obs, b0_], [b1obs, b1_], 'k-', alpha=.5)
ax.set_xlabel('Intercept', fontsize=14)
ax.set_ylabel('Slope', fontsize=14)
ax.set_xlim(1.5, 6.1)
ax.set_ylim(-2.5, .5)
plt.legend()
plt.tight_layout()
plt.savefig('shrink.jpg',dpi=300);

Now, having learned how to do the multivariate normal stuff, I'm going to sneak in some personal experience: in general, the Hamiltonian algorithms are so good with highly correlated varaibles that we often don't need the correlation matrix (unless something important is in there). So let's take a look at just the hierarchical model version on it's own

In [None]:
with pm.Model() as Cafes:
    # Global intercept
    γ0 = pm.Normal('global_intercept', 5, 2)
    # Global slope
    γ1 = pm.Normal('global_slope', -1, 0.5)
    # Variances
    σγ0 = pm.Exponential('σγ0',1)
    σγ1 = pm.Exponential('σγ1',1)
    
    # Cafe intercepts and slopes
    β0 = pm.Normal('cafe_intercept', γ0, σγ0, shape=ncafe)
    β1 = pm.Normal('cafe_slope', γ1, σγ1, shape=ncafe)
    
    # Linear model
    μ = β0[Ic]+β1[Ic]*PM
    
    # Data likelihood
    σ = pm.HalfCauchy('sigma', beta=2)
    Yi = pm.Normal('Yi', μ, σ, observed=wait)

In [None]:
with Cafes:
    trace_c2 = pm.sample(1000)

In [None]:
pm.summary(trace_c2)

And with these results we can take a look at how they compare to our multivariate normal results above

In [None]:
# Multivariate intercepts and slopes
mv_int = trace_c.posterior['ab_cafe'].values[0].mean(0).T[0]
mv_slope = trace_c.posterior['ab_cafe'].values[0].mean(0).T[1]
# Hierarhical intercepts and slopes
h_int = trace_c2.posterior['cafe_intercept'].values[0].mean(0)
h_slope = trace_c2.posterior['cafe_slope'].values[0].mean(0)

In [None]:
_, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(b0,mv_int, label='MvNormal')
ax[0].scatter(b0,h_int,c='red',label='Not')
ax[0].set_xlabel('Intercept', fontsize=15)
ax[0].set_ylabel('True value', fontsize=15)
ax[0].legend()
ax[1].scatter(b1,mv_slope)
ax[1].scatter(b1,h_slope,c='red')
ax[1].set_xlabel('Slope', fontsize=15)
#plt.axhline(trace_c['global_slope'].mean(),linestyle=':')
plt.tight_layout()
plt.savefig('hlm.jpg',dpi=300);

Which is crazy - when we look at the posteriors for one of our cafe intercepts and slopes, we can see the correlation in both kinds of posterior

In [None]:
_, ax = plt.subplots(1, 2, figsize=(10, 5))
# Plot a multivariate normal slope vs intercept
ax[0].scatter(trace_c.posterior['ab_cafe'].values[0].T[0][0],trace_c.posterior['ab_cafe'].values[0].T[1][0])
ax[0].set_xlabel('Intercept', fontsize=15)
ax[0].set_xlabel('Slope', fontsize=15)
ax[0].set_title('MvNormal', fontsize=15)

# Plot a hierarhical normal slope vs intercept
ax[1].scatter(trace_c2.posterior['cafe_intercept'].values[0].T[0],trace_c2.posterior['cafe_slope'].values[0].T[0])
ax[1].set_xlabel('Intercept', fontsize=15)
ax[1].set_xlabel('Slope', fontsize=15)
ax[1].set_title('Not', fontsize=15);

So, what happens to our beloved shrinkage? 

In [None]:
# Pooled estimates
b0_ = trace_c2.posterior['cafe_intercept'].values[0].mean(0)
b1_ = trace_c2.posterior['cafe_slope'].values[0].mean(0)

In [None]:
# Draw contours and show shrinkage
_, ax = plt.subplots(1, 1, figsize=(5, 5))
Gauss2d(Mu_est, np.asarray(Sigma_est), [0.1, 0.3, 0.5, 0.8, 0.99], ax=ax, ec='lightgrey')
ax.scatter(b0obs, b1obs)
ax.scatter(b0_, b1_, facecolors='none', edgecolors='k', lw=1)
ax.plot([b0obs, b0_], [b1obs, b1_], 'k-', alpha=.5)
ax.set_xlabel('Intercept', fontsize=14)
ax.set_ylabel('Slope', fontsize=14)
ax.set_xlim(1.5, 6.1)
ax.set_ylim(-2.5, 0);

It still happens - despite the lack of correlation matrix - by following the gradients. 

For a bit more insight into Hamiltonian MC, it's worth watching this great talk by Michael Betancourt: https://www.youtube.com/watch?v=VnNdhsm0rJQ