## Week 11 Lecture 2 - It's instrumental 

McElreath's lectures for today: https://www.youtube.com/watch?v=oMiSb8GKR0o&list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN&index=19

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/stat_rethinking_2022

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

Dustin Stansbury has some lovely PyMC Code available here: https://github.com/dustinstansbury/statistical-rethinking-2023

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import pymc as pm
import arviz as az
from matplotlib import pyplot as plt
import networkx as nx
import dataframe_image as dfi


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)

def stdizeNA(x):
    xnew = x
    mask = np.isnan(x)==False
    xnew[mask] = (x[mask]-np.mean(x[mask]))/np.std(x[mask])
    return xnew

def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def logit(p):
    return np.log(p) - np.log(1 - p)

def invlogit(p):
    return np.exp(p) / (1 + np.exp(p))


from matplotlib.patches import Ellipse
from scipy.stats import chi2


# Causal instruments

Unmeasured variables can ruin your day. It may be that you have the greatest dataset ever, with all the variables you care about collected without error. Except for that one common cause that, for one reason or another, no one bothered to think about. Unfortunately in many cases such unmeasured variables generate an unknowable level of bias that will put all your causal conclusions in doubt. It's just the way of the world.

Yet not all is lost - in some cases, if you understand the problem really well, there lies hope. Given some very specific conditions, there may be another measured variable that will save you - the instrument, a variable that bypasses the confounding backdoor and delivers causal inference on a plate.

The special conditions are only three, and are often hard to meet, but they can save the day if met. For a simple problem where Education ($E$) affects wages ($W$), we might have some unmeasured confound ($U$):

In [None]:
fig = plt.figure(figsize=(5,5))
G = nx.DiGraph(directed=True)
G.add_edges_from(
    [('U', 'W'), ('U', 'E'), ('E', 'W')], width=3)
cdict = {'U':'red', 'W':'black', 'E':'black'}
options = {
    'node_color': 'white',
    'node_size': 400,
    'width': 2,
    'arrowstyle': '->',
    'arrowsize': 15,
    "edgecolors": [cdict[i] for i in np.array(G.nodes)],
    "alpha": 0.7
}
nx.draw_networkx(G, arrows=True, **options)
plt.savefig('basicDAG.jpg',dpi=300)

If we can find some sort of an instrument, we can overcome this probelm. Here an instrument might be the quarter of the year ($Q$) that someone is born in. This quarter variable $Q$ is instrumenetal if:

1. Q is independent of U (Q⫫U)
2. Q is not-independent of E (Q$\not\!\perp\!\!\!\perp$E)
3. Q cannot influence W except through E

The first condition is obvious - U needs to not confound the instrument Q. The second is also - we need Q to represent the effect of E. The third conditon is because we need the effect of Q to be isolated to working just through E. Importantly there may be cases where this can happen by blocking other backdoor paths (and dagitty has settings for this), but for clarity here it happens because there are no other pathways from Q than through E.

For the DAG above, it might look like this

In [None]:
fig = plt.figure(figsize=(5,5))
G = nx.DiGraph(directed=True)
G.add_edges_from(
    [('U', 'W'), ('U', 'E'), ('Q', 'E'), ('E', 'W')], width=3)
cdict = {'U':'red', 'Q':'green', 'W':'black', 'E':'black'}
options = {
    'node_color': 'white',
    'node_size': 400,
    'width': 2,
    'arrowstyle': '->',
    'arrowsize': 15,
    "edgecolors": [cdict[i] for i in np.array(G.nodes)],
    "alpha": 0.7
}
nx.draw_networkx(G, arrows=True, **options)
plt.savefig('InstrumentDAG.jpg',dpi=300)

To make this all concrete, let's simulate some data for the situation above

In [None]:
# Number of cases
N = 500

# Create U
U_sim = np.random.normal(size=N)
# Create Q
Q_sim = np.random.choice([1, 2, 3, 4], size=N, replace=True)
# Create E
E_sim = np.random.normal(loc=U_sim + Q_sim)
# Create W
W_sim = np.random.normal(loc=U_sim + 0*E_sim)

# Standardize the data
dat_sim = {
    "W": sp.stats.zscore(W_sim),
    "E": sp.stats.zscore(E_sim),
    "Q": sp.stats.zscore(Q_sim)
}

So here, where the effect of $E$ is set to be zero, let's see what the consequences of U are on our inference

In [None]:
with pm.Model() as m14_4:
    # Priors
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Prior for aW
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Prior for bEW
    sigma = pm.Exponential("sigma", lam=1)  # Prior for sigma

    # Linear model
    mu = aW + bEW * dat_sim["E"]

    # Likelihood
    W = pm.Normal("W", mu=mu, sigma=sigma, observed=dat_sim["W"])

In [None]:
with m14_4:
    # Sampling
    trace_basic = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.summary(trace_basic)

In [None]:
pm.plot_forest(trace_basic, var_names=['aW','bEW','sigma'],figsize=(5,5),combined=True)
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('basic.jpg',dpi=300);

So where we run an ingnorant analysis, where we don't have a DAG to tell us that U is an important confounder, the resutls show that E has a clear effect on W - something around 0.4 [0.3, 0.5]. Clearly we have a problem.

Now how should we apply our instrument? Simply add it to the model?

In [None]:
with pm.Model() as m14_5:
    # Priors
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Prior for aW
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Prior for bEW
    bQW = pm.Normal("bQW", mu=0, sigma=0.5)  # Prior for bQW
    sigma = pm.Exponential("sigma", lam=1)  # Prior for sigma

    # Linear model
    mu = aW + bEW * dat_sim["E"] + bQW * dat_sim["Q"]

    # Likelihood
    W = pm.Normal("W", mu=mu, sigma=sigma, observed=dat_sim["W"])

In [None]:
with m14_5:
    # Sampling
    trace_nutty = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.summary(trace_nutty)

In [None]:
pm.plot_forest(
    [trace_basic,trace_nutty], 
    model_names = ['Basic','Amplified'],
    var_names=['aW','bEW','bQW','sigma'],
    figsize=(5,5),
    combined=True

)
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('nutty.jpg',dpi=300);

Ok, now things are going really nuts - the effect of E has become even larger, due to the backdoor path to Q that is negative. Clearly that's not how to introduce the instrument as it simply amplifies the bias of U. 

Inclusion of Q needs to happen in a different way - through what is known as the *generative model*. A generative model is a way of thinking about modelling whereby you create a model based on how you think the data arose in the first place. In the case where you simulate data it's easy, in that we can simply translate each step in the simulation into a node in the model, so

`W_sim = np.random.normal(loc=U_sim + 0*E_sim)`

becomes

$$
W_i \sim \text{Normal}(\mu_{w,i}, \sigma_w)
$$
$$
\mu_{w,i} = \alpha_w + \beta_{ew} E_i + U_i
$$

`E_sim = np.random.normal(loc=U_sim + Q_sim)`

becomes

$$
E_i \sim \text{Normal}(\mu_{e,i}, \sigma_e)
$$
$$
\mu_{e,i} = \alpha_e + \beta_{qe} Q_i + U_i
$$

`Q_sim = np.random.choice([1, 2, 3, 4], size=N, replace=True)`

becomes

$$
Q_i \sim \text{Categorical}([0.25, 0.25, 0.25, 0.25])
$$

and 

`U_sim = np.random.normal(size=N)`

becomes

$$
U_i \sim \text{Normal}(0, 1)
$$

With all these elements in place, we can then translate into a full statistical model that averages over $U$ and estimates instead the covariance between $W$ and $E$ - essentially dropping $U$ from both linear models and instead estimating that same effect through the correlation parameter $\rho$

$$
\begin{bmatrix}
W_i \\
E_i
\end{bmatrix}
\sim \text{MVNormal}
\left(
\begin{bmatrix}
\mu_{w,i} \\
\mu_{e,i}
\end{bmatrix},
\mathbf{S}
\right)
$$

$$
\mu_{w,i} = \alpha_w + \beta_{ew} E_i
$$
$$
\mu_{e,i} = \alpha_e + \beta_{qe} Q_i
$$

In PyMC this becomes

In [None]:
with pm.Model() as m14_6:
    # Priors for intercepts
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Intercept for W
    aE = pm.Normal("aE", mu=0, sigma=0.2)  # Intercept for E

    # Priors for slopes
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Slope for E in muW
    bQE = pm.Normal("bQE", mu=0, sigma=0.5)  # Slope for Q in muE

    # Linear models
    muW = aW + bEW * dat_sim["E"]
    muE = aE + bQE * dat_sim["Q"]

    # Priors for correlation matrix and standard deviations
    RhoCO = pm.LKJCholeskyCov("RhoCO", n=2, eta=2, sd_dist=pm.Exponential.dist(1))
    chol, corr, stds = RhoCO

    # Sigma
    Sigma = pm.Deterministic('Sigma', stds)

    # Rho
    Rho = pm.Deterministic('Rho', corr)

    # Multivariate normal distribution for (W, E)
    cov = pm.Deterministic("cov", chol @ chol.T)
    mu = pm.Deterministic("mu", pm.math.stack([muW, muE], axis=1))
    observed = np.stack([dat_sim["W"], dat_sim["E"]], axis=1)

    # likelihoood
    W_E = pm.MvNormal("W_E", mu=mu, chol=chol, observed=observed)

In [None]:
with m14_6:
    # Sampling
    trace_boss = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.summary(trace_boss, var_names=['aE','aW','bQE','bEW','Rho','Sigma'])

In [None]:
pm.plot_forest(
    [trace_basic,trace_nutty,trace_boss], 
    model_names = ['Basic','Amplified','MvN'],
    var_names=['aE','aW','bQE','bEW','bQW','sigma','Rho','Sigma'],
    figsize=(5,5),
    combined=True
)
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('full.jpg',dpi=300);

Now we get what we want - $E$ no longer has an effect (we set it to zero after all) and we make the correct inference. What would have happened if we had not had $Q$ in here? The model allows for $U$ after all...

In [None]:
with pm.Model() as m14_6_2:
    # Priors for intercepts
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Intercept for W
    aE = pm.Normal("aE", mu=0, sigma=0.2)  # Intercept for E

    # Priors for slopes
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Slope for E in muW
    bQEx = pm.Normal("bQEx", mu=0, sigma=0.5)  # Slope for Q in muE

    # Linear models
    muW = aW + bEW * dat_sim["E"]
    muE = aE + (bQEx * dat_sim["Q"])*0

    # Priors for correlation matrix and standard deviations
    RhoCO = pm.LKJCholeskyCov("RhoCO", n=2, eta=2, sd_dist=pm.Exponential.dist(1))
    chol, corr, stds = RhoCO

    # Sigma
    Sigma = pm.Deterministic('Sigma', stds)

    # Rho
    Rho = pm.Deterministic('Rho', corr)

    # Multivariate normal distribution for (W, E)
    cov = pm.Deterministic("cov", chol @ chol.T)
    mu = pm.Deterministic("mu", pm.math.stack([muW, muE], axis=1))
    observed = np.stack([dat_sim["W"], dat_sim["E"]], axis=1)

    # likelihoood
    W_E = pm.MvNormal("W_E", mu=mu, chol=chol, observed=observed)

In [None]:
with m14_6_2:
    # Sampling
    trace_boss2 = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.summary(trace_boss2, var_names=['aE','aW','bQEx','bEW','Rho','Sigma'])

In [None]:
pm.plot_forest(
    [trace_basic,trace_nutty,trace_boss,trace_boss2], 
    model_names = ['Basic','Amplified','MvN','Ignorant MvN'],
    var_names=['aE','aW','bQE','bEW','bQW','sigma','Rho','Sigma'],
    figsize=(5,5),
    combined=True
)
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('overfull.jpg',dpi=300);

So the effect blows up - it's better in the sense that it's now overlapping zero, but it also could be anything from the other models. So awful.

For hellery let's also run the generative model, where we estimate $U$

In [None]:
with pm.Model(coords={'group':['W','E']}) as m14_6_x:
    # Priors for intercepts
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Intercept for W
    aE = pm.Normal("aE", mu=0, sigma=0.2)  # Intercept for E

    # Priors for slopes
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Slope for E in muW
    bQEx = pm.Normal("bQEx", mu=0, sigma=0.5)  # Slope for Q in muE

    # Unmeasured
    U = pm.Normal('U')

    # Linear models
    muW = aW + bEW * dat_sim["E"] + U
    muE = aE + (bQEx * dat_sim["Q"])*0 + U

    # Sigma
    Sigma = pm.Exponential('Sigma', 1, dims='group')

    # likelihoood
    W_ = pm.Normal("W_", muW, Sigma[0], observed=dat_sim["W"])
    E_ = pm.Normal("E_", muE, Sigma[1], observed=dat_sim["E"])

In [None]:
with m14_6_x:
    # Sampling
    trace_boss3 = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.plot_forest(
    [trace_basic,trace_nutty,trace_boss,trace_boss2,trace_boss3], 
    model_names = ['Basic','Amplified','MvN','Ignorant MvN','Linear Gen mod.'],
    var_names=['aE','aW','bQE','bEW','bQW','sigma','Rho','Sigma'],
    figsize=(5,5),
    combined=True
)
plt.axvline(0,linestyle=":",c='black')
plt.tight_layout()
plt.savefig('stuffed.jpg',dpi=300);

It doesn't really work at all - there's not enough information in the independent U to estimate it's effect. 


Let's try this again where we vary the parameters so that $E$ now has an effect

In [None]:
# Number of cases
N = 500

# Create U
U_sim = np.random.normal(size=N)
# Create Q
Q_sim = np.random.choice([1, 2, 3, 4], size=N, replace=True)
# Create E
E_sim = np.random.normal(loc=U_sim + Q_sim)
# Effect of E
bE_sim = -1.
# Create W
W_sim = np.random.normal(loc=U_sim + bE_sim*E_sim)

# Standardize the data
dat_sim = {
    "W": sp.stats.zscore(W_sim),
    "E": sp.stats.zscore(E_sim),
    "Q": sp.stats.zscore(Q_sim)
}

And re-running our model above

In [None]:
with pm.Model() as m14_6_2:
    # Priors for intercepts
    aW = pm.Normal("aW", mu=0, sigma=0.2)  # Intercept for W
    aE = pm.Normal("aE", mu=0, sigma=0.2)  # Intercept for E

    # Priors for slopes
    bEW = pm.Normal("bEW", mu=0, sigma=0.5)  # Slope for E in muW
    bQE = pm.Normal("bQE", mu=0, sigma=0.5)  # Slope for Q in muE

    # Linear models
    muW = aW + bEW * dat_sim["E"]
    muE = aE + bQE * dat_sim["Q"]

    # Priors for correlation matrix and standard deviations
    RhoCO = pm.LKJCholeskyCov("RhoCO", n=2, eta=2, sd_dist=pm.Exponential.dist(1))
    chol, corr, stds = RhoCO

    # Sigma
    Sigma = pm.Deterministic('Sigma', stds)

    # Rho
    Rho = pm.Deterministic('Rho', corr)

    # Multivariate normal distribution for (W, E)
    cov = pm.Deterministic("cov", chol @ chol.T)
    mu = pm.Deterministic("mu", pm.math.stack([muW, muE], axis=1))
    observed = np.stack([dat_sim["W"], dat_sim["E"]], axis=1)

    # likelihoood
    W_E = pm.MvNormal("W_E", mu=mu, chol=chol, observed=observed)

In [None]:
with m14_6_2:
    # Sampling
    trace_bossy = pm.sample(chains=4, cores=4, return_inferencedata=True)

In [None]:
pm.summary(trace_bossy, var_names=['aE','aW','bQE','bEW','Rho','Sigma'])

In [None]:
pm.plot_forest(trace_bossy, var_names=['aE','aW','bQE','bEW','Rho','Sigma'],figsize=(5,5),combined=True)
plt.axvline(0,linestyle=":",c='black')
plt.axvline(bE_sim,c='red',zorder=0)
plt.tight_layout()
plt.savefig('covar.jpg',dpi=300);

Now we're cooking - we get back the $-1$ish effect of $E$ on $W$. 