## Week 11 Lecture 1 - Missing Data and other Opportunties

McElreath's lectures for today: https://www.youtube.com/watch?v=oMiSb8GKR0o&list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN&index=19

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/stat_rethinking_2022

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

Dustin Stansbury has some lovely PyMC Code available here: https://github.com/dustinstansbury/statistical-rethinking-2023

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import pymc as pm
import arviz as az
from matplotlib import pyplot as plt
import networkx as nx
import dataframe_image as dfi


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)

def stdizeNA(x):
    xnew = x
    mask = np.isnan(x)==False
    xnew[mask] = (x[mask]-np.mean(x[mask]))/np.std(x[mask])
    return xnew

def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def logit(p):
    return np.log(p) - np.log(1 - p)

def invlogit(p):
    return np.exp(p) / (1 + np.exp(p))


from matplotlib.patches import Ellipse
from scipy.stats import chi2


# Latent states

With many observational datasets, we will have both means **and associated variances** (often called observation error) to work with. This could be a measure of variabitliy in a mass-spectrometer for some sample of plan tissue, or it could be summary data from a whole lot of studies (i.e. a meta-analysis). So far we've ignored this, but it can be crucial for accurately representing uncertainty, and making better (unbiased) estimates. 

By way of example, let's look back at the divorce rates data from the US

In [None]:
ddata = pd.read_csv('WaffleDivorce.csv', sep=';')
ddata['log_population'] = np.log(ddata['Population'])
# Grab variables
A = stdize(ddata.MedianAgeMarriage.values)
M = stdize(ddata.Marriage.values)
Mse = ddata['Marriage SE'].values/np.std(ddata.Marriage.values)
D = stdize(ddata.Divorce.values)
Dse = ddata['Divorce SE'].values/np.std(ddata.Divorce.values)
nstates = len(D)
State = ddata.Loc.values
dfi.export(ddata.head(5), 'divdata.jpg')
ddata.head()

If you look closely there are SE (standard error) columns, which tell us something about the level of uncertainty in the estimated marriage and divorce rates in each state. These are larger in less populous states because sample sizes there are much smaller. Let's focus on divorce rates and plot them against their mean values

In [None]:
_, ax = plt.subplots(1, 2, figsize=(12, 5))

# points
ax[0].scatter(ddata['MedianAgeMarriage'], ddata['Divorce'], marker='o', facecolor='white', edgecolors='k', linewidth=1)
# standard errors
ax[0].errorbar(ddata['MedianAgeMarriage'], ddata['Divorce'], ddata['Divorce SE'].values, ls='none', color='k', linewidth=1, zorder=0)
ax[0].set_xlabel('Median age marriage', fontsize=15)
ax[0].set_ylabel('Divorce rate', fontsize=15)
ax[0].set_ylim(4, 15)

# points
ax[1].scatter(ddata['log_population'], ddata['Divorce'], marker='o', facecolor='white', edgecolors='k', linewidth=1)
# standard errors
ax[1].errorbar(ddata['log_population'], ddata['Divorce'], ddata['Divorce SE'].values, ls='none', color='k', linewidth=1, zorder=0)
ax[1].set_xlabel('log(population)', fontsize=15)
ax[1].set_ylabel('Divorce rate', fontsize=15)
ax[1].set_ylim(4, 15)
plt.savefig('waffles.jpg',dpi=300);

These error bars represent a really important source of variability, one that should make us more skeptical (or at least less sure) about values on the left hand side of the right hand panel. Indeed from that plot it looks like higher or lower mean divorce rates seem to occur where populations are smaller. So how should we add in this information? Well you're in luck - it's remarkably easy in a Bayesian model (WAY easier than for a frequentist one), where we just add an additional layer to incorporate this source of uncertainty. 

So let's take a look at our original model

In [None]:
# Standardize variables
A = stdize(ddata.MedianAgeMarriage.values)
Msd = ddata['Marriage SE'].values/np.std(ddata.Marriage.values)
M = stdize(ddata.Marriage.values)
Dsd = ddata['Divorce SE'].values/np.std(ddata.Divorce.values)
D = stdize(ddata.Divorce.values)

In [None]:
with pm.Model() as AM:
    # Baseline intercept
    β0 = pm.Normal('Intercept', 0, 0.2)
    # Marriage rate
    β1 = pm.Normal('M', 0, 0.5)
    # Marriage age
    β2 = pm.Normal('A', 0, 0.5)
    # Linear model
    μ = β0+β1*M+β2*A
    # Error
    σ = pm.Uniform('Sigma', 0, 10)
    # Likelihood
    Yi = pm.Normal('Yi', μ, σ, observed=D)

And now a second model that incorporates these uncertainty in divorce rates

In [None]:
with pm.Model(coords={'States':State}) as AM_latent:
    # Baseline intercept
    β0 = pm.Normal('Intercept', 0, 0.2)
    # Marriage rate
    β1 = pm.Normal('M', 0, 0.5)
    # Marriage age
    β2 = pm.Normal('A', 0, 0.5)
    # Linear model
    μ = β0+β1*M+β2*A
    # Error
    σ = pm.Uniform('Sigma', 0, 10)
    # Latent state
    Dμ = pm.Normal('Latent_divorce', μ, σ, dims='States')
    # Likelihood
    Yi = pm.Normal('Yi', Dμ, Dsd, observed=D)

Looking carefully what you can see is that there is now another layer between the observed divorce rates and the rest of the model. This layer represents the true underlying divorce rates - called a latent state - which we can estimate based on both the observation error and the divorce rate estimates in other states. This kind of latent state (or 'hidden Markov') model is very powerful and we should use it wherever we are able. 

Let's fire these up and see what happens

In [None]:
with AM:
    trace_am = pm.sample(1000)
with AM_latent:
    trace_aml = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_am)
dfi.export(tmp, 'trace_am.jpg')
tmp

In [None]:
tmp = pm.summary(trace_aml, var_names=['Intercept', 'M', 'A', 'Sigma'])
dfi.export(tmp, 'trace_aml.jpg')
tmp

Kinda similar, but take a look at Sigma - it's *gotten smaller*. Why? Because we've apportioned some of the variation to observation error, making us more sure about the variation in underlying divorce rates. This is why we model this stuff - to make better estimates.

Let's take a look at what's happened to our latent state estimates relative to their observed means

In [None]:
# Latent estimates
Dl = trace_aml.posterior['Latent_divorce'].stack(sample=("chain", "draw")).values.mean(1)

In [None]:
_, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plot difference between observed mean and latent mean against observed standard error
ax[0].scatter(Dsd, D-Dl, marker='o', facecolor='white', edgecolors='k', linewidth=1)
ax[0].axhline(0, linestyle='--', c='grey', zorder=0)
# Label outlying states
indx = abs(D-Dl)>0.5
oset=0.03
[ax[0].text(x+oset,y+oset,d) for x,y,d in zip(Dsd[indx], (D-Dl)[indx], State[indx])]
# Labels
ax[0].set_xlabel('D_SD', fontsize=15)
ax[0].set_ylabel('D_est-D_obs', fontsize=15)

# Regression line
xnew = np.linspace(min(D)-0.5,max(D)+1,100)
ynew = trace_aml.posterior['Intercept'].values.mean()+trace_aml.posterior['A'].values.mean()*xnew
ax[1].plot(xnew, ynew, c='dodgerblue',zorder=0)
# Size points proportional to D SE
pdx = Dsd*70
# Plot regressed divorce rate vs marraige age
ax[1].scatter(A, D, marker='o', facecolor='white', edgecolors='k', linewidth=1, label='Observed')
ax[1].scatter(A, Dl, marker='o', facecolor='black', edgecolors='k', linewidth=1, s=pdx, label='Latent')
[ax[1].plot((a,a),(d,dl), c='grey', zorder=0) for a,d,dl in zip(A,D,Dl)]
# Label outlying states
[ax[1].text(x+oset,y+oset,d) for x,y,d in zip(A[indx], (D)[indx], State[indx])]
# Labels
ax[1].set_ylabel('Divorce rate (std)', fontsize=15)
ax[1].set_xlabel('Median marriage age (std)', fontsize=15)
plt.legend()
plt.savefig('shrinkage.jpg',dpi=300);

What you can see from these panels is that the states with the greatest observation error, and are farthest from the regression expectation, tend to have the largest shrinkage toward the regression line. Automagically.

Aside from observation errors in divorce rate, the dataset also has observation error for the marriage rates, which is a covariate. No problem! In Bayesland we can incorporate this in a similar way

In [None]:
with pm.Model(coords={'States':State}) as AM_latent2:
    # Baseline intercept
    β0 = pm.Normal('Intercept', 0, 0.2)
    # Marriage rate
    β1 = pm.Normal('M', 0, 0.5)
    # Latent marriage rate
    Mμ = pm.Normal('Latent_marriage', 0, 1, dims='States')
    # Observed marriage rate
    Zi = pm.Normal('Zi', Mμ, Msd, observed=M)
    # Marriage age
    β2 = pm.Normal('A', 0, 0.5)
    # Linear model
    μ = β0+β1*Mμ+β2*A
    # Error
    σ = pm.Uniform('Sigma', 0, 10)
    # Latent state
    Dμ = pm.Normal('Latent_divorce', μ, σ, dims='States')
    # Likelihood
    Yi = pm.Normal('Yi', Dμ, Dsd, observed=D)

Here we've got a new hierarhical model for marriage rates nested within our larger model, including a data likelihood for the observed marriage rates ($M$), from which we estimate the latent marriage rates ($M\mu$) that are subsequently used to estimate $\mu$ in the model. Let's see how this goes

In [None]:
with AM_latent2:
    trace_aml2 = pm.sample(1000)

In [None]:
# Latent estimates take 2
Dl2 = trace_aml2.posterior['Latent_divorce'].stack(sample=("chain", "draw")).values.mean(1)
# Latent estimates take 2
Ml = trace_aml2.posterior['Latent_marriage'].stack(sample=("chain", "draw")).values.mean(1)

In [None]:
_, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plot difference between observed mean and latent mean against observed standard error
ax[0].scatter(Dsd, D-Dl, marker='o', facecolor='white', edgecolors='k', linewidth=1)
ax[0].scatter(Dsd, D-Dl2, marker='o', facecolor='white', edgecolors='red', linewidth=1)
ax[0].axhline(0, linestyle='--', c='grey', zorder=0)
# Label outlying states
indx = abs(D-Dl)>0.5
oset=0.03
[ax[0].text(x+oset,y+oset,d) for x,y,d in zip(Dsd[indx], (D-Dl)[indx], State[indx])]
# Labels
ax[0].set_xlabel('D_SD', fontsize=15)
ax[0].set_ylabel('D_est-D_obs', fontsize=15)

# Plot regressed divorce rate vs marraige age
ax[1].scatter(M, D, marker='o', facecolor='white', edgecolors='k', linewidth=1, label='Observed')
ax[1].scatter(Ml, Dl2, marker='o', facecolor='red', edgecolors='red', linewidth=1, label='Latent2')
ax[1].scatter(M, Dl, marker='o', facecolor='black', edgecolors='k', linewidth=1, s=pdx, label='Latent',zorder=0)
[ax[1].plot((m,ml),(d,dl), c='grey', zorder=0) for m,ml,d,dl in zip(M,Ml,D,Dl2)]
[ax[1].plot((m,ml),(d,dl), c='grey', linestyle=':', zorder=0) for m,ml,d,dl in zip(M,Ml,Dl,Dl2)]
# Label outlying states
[ax[1].text(x+oset,y+oset,d) for x,y,d in zip(M[indx], (D)[indx], State[indx])]
# Labels
ax[1].set_ylabel('Divorce rate (std)', fontsize=15)
ax[1].set_xlabel('Marriage rate (std)', fontsize=15)
plt.legend()
plt.savefig('shrinkage2.jpg',dpi=300);

There's a bit more movement in the latent state estimates over the previous model, with Maine (ME) in particular showing a lot of shrinkage - possibly because it has the highest divorce rate, but a relatively small population, making the esimated marriage rates far less likely given the observation error and the model. But in the left panel we can see how now the latent estimates move toward the estimate regression line, for **both marriage rate and divorce rate**, which we can see looking back at the much larger estimated relationship between M and D:

In [None]:
pm.summary(trace_aml, var_names=['Intercept', 'M', 'A', 'Sigma'])

In [None]:
tmp = pm.summary(trace_aml2, var_names=['Intercept', 'M', 'A', 'Sigma'])
dfi.export(tmp, 'trace_aml2.jpg')
tmp

# Missing data

Among the oddest realities of Bayesian statistics for the orthodox student is that the only difference between random variates and data is that the values of data are fixed. Both are conditional on some stochastic node (distribution). Given some mean $\mu$ and standard deviation $\sigma$ in a normal distribution for example, random values can be realized from that distribution proportional to their likelihood conditional on $\mu$ and $\sigma$. When data is present however, it is $\mu$ and $\sigma$ that become, in a sense, random - they require adujstment to maximize the data likelihood conditional on the data. The information flows in the other direction. 

What's cool about this for missing data is that both processes can happen at once - if you have 1000 observations, 10 of which are missing, $\mu$ and $\sigma$ can be estimated by the other 990 (known) observations and then used to generate random values for the 10 missing observations. 

By way of example, let's take a look at the primates milk data again

In [None]:
# Import data
mdata = pd.read_csv('milk.csv', sep=';')
# Add log(mass) column
mdata['log(mass)'] = np.log(mdata.mass.values)
dfi.export(mdata.head(), 'milk.jpg')
mdata.head()

Straight away we can see there are a lot of missing values in the brain size (`neocortex.perc`) column. But why are they missing? This seemingly unimportant question, has three distinct (and terribly named) possibilites:

1. **Missing completely at random (MCAR)** - the probability that any given datapoint is missing is equal over the whole dataset.

2. **Missing at random (MAR)** - the probability that any given datapoint is missing is dependent on some other variable that is fully observed

3. **Missing not at random (MNAR)** - the probability that any given datapoint is missing is dependent on some other variable that is unobserved

We'll build two models to estimate brain size (B), the first being MCAR and the second being MAR, where values for B are dependent on body mass (M). Back with PyMC3 this was done through numpy's `masked_array` feature:

In [None]:
# Grab variables of interest
kcal = stdize(mdata['kcal.per.g'])
M = stdize(mdata['log(mass)'].values)
# Grab B with NaN's
B = stdizeNA(mdata['neocortex.perc'].values)
# Mask NaN's
B_ = np.ma.masked_array(B, mask=np.isnan(B))/100
B_

In [None]:
B

What we've got here is a multifaceted object: a `data` array, with numbers and dashes, indicating observed values and a spot for the missing values, as well as a `mask` array, of true/false values for if the mask is missing. Lastly there is a `fill_value` attribute, which says what number is a placeholder for the missing values, so that the full array can be held in memory. With PyMC v4.0 however, NAs are handled automagically and the MCAR model is very similar to our observation error model, the difference being that the latent state here is only partially so - where there is data, those values don't change:

In [None]:
# PyMC Model
with pm.Model() as milker:
    # Priors
    β0 = pm.Normal('Intercept', 0, .2)
    β1 = pm.Normal('log(mass)', 0, .5)
    β2 = pm.Normal('neocortex_perc', 0, .5)
    σ = pm.Exponential('Sigma', 1)
    
    # Missing data
    ν = pm.Normal('BS_mean', 0.5, 1)
    σν = pm.Exponential('BS_sigma', 1)
    Bμ = pm.Normal('Bμ', ν, σν, observed=B)
    
    # Linear model
    μ = β0+β1*M+β2*Bμ
    
    # Likelihood
    yi = pm.Normal('yi',μ, σ, observed=kcal)

In [None]:
with milker:
    trace_m = pm.sample(1000)

In [None]:
pm.summary(trace_m)

And so what do we get back - well from the MCAR, we get estimates for the missing brain size values conditional on the model and their group average `BS_mean`. Within the context of the model, this makes for the variation:

In [None]:
# Missing value estimates
tmp = trace_m.posterior['Bμ'].stack(sample=("chain", "draw")).values
Bl = tmp.mean(1)
Bll90 = np.percentile(tmp,5,axis=1)
Blu90 = np.percentile(tmp,95,axis=1)
Bll50 = np.percentile(tmp,25,axis=1)
Blu50 = np.percentile(tmp,75,axis=1)

In [None]:
len(kcal[B_.mask]),len(Bl[B_.mask])

In [None]:
_, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plot observed values
ax[0].scatter(B, kcal, marker='o', facecolor='white', edgecolors='k', linewidth=1)
# Plot imputed values
ax[0].scatter(Bl[B_.mask], kcal[B_.mask], marker='o', facecolor='black', edgecolors='k', linewidth=1)
[ax[0].plot((d,dl),(a,a), c='grey', zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll90,Blu90)]
[ax[0].plot((d,dl),(a,a), c='grey', linewidth=3, zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll50,Blu50)]
# Labels
ax[0].set_xlabel('Neocortex percent (std)', fontsize=15)
ax[0].set_ylabel('kcal milk (std)', fontsize=15)


# Plot observed values
ax[1].scatter(M, B, marker='o', facecolor='white', edgecolors='k', linewidth=1, label='Observed')
# Plot imputed values
ax[1].scatter(M[B_.mask], Bl[B_.mask], marker='o', facecolor='black', edgecolors='k', linewidth=1, label='Imputed')
[ax[1].plot((a,a),(d,dl), c='grey', zorder=0) for a,d,dl in zip(M[B_.mask],Bll90,Blu90)]
[ax[1].plot((a,a),(d,dl), c='grey', linewidth=3, zorder=0) for a,d,dl in zip(M[B_.mask],Bll50,Blu50)]
# Labels
ax[1].set_ylabel('Neocortex percent (std)', fontsize=15)
ax[1].set_xlabel('log(Body mass) (std)', fontsize=15)
plt.legend()
plt.savefig('missingmilk.jpg',dpi=300);

You can see above that the missing values are highly uncertain, but they also tend toward the central relationships present in the observed data. This is excellent stuff. 

MCAR does not provide information about the missing values other than the relationships already present in the model. In contrast, the MAR model assumes there is a fully-observed covariate that provides information about what the missing values might be. In other words, we're assuming they're correlated in some uknown way. You can probabily guess what happens next

In [None]:
# Mask NaN's
X = np.array([B_,M]).T
BM = np.ma.masked_array(X, mask=np.isnan(X))
BM

In [None]:
# Bayesian PyMC3
with pm.Model() as milkerMvN:
    # Priors
    β0 = pm.Normal('Intercept', 0, .2)
    β1 = pm.Normal('log(mass)', 0, .5)
    β2 = pm.Normal('neocortex_perc', 0, .5)
    σ = pm.Exponential('Sigma', 1)
    
    ## Missing data at random
    # Brain size prior
    ν0 = pm.Normal('BS_mean', 0.5, 1)
    # Body mass prior
    ν1 = pm.Normal('BM_mean', 0.5, 1)
    
    # Hyperprior for LKJ
    sd_dist = pm.HalfCauchy.dist(2)
    chol, _, _ = pm.LKJCholeskyCov('chol_cov', eta=2, n=2, sd_dist=sd_dist)
    
    # MvN
    Bμ = pm.MvNormal('Bμ', mu=[ν0, ν1], chol=chol, observed=BM)
    
    # Linear model
    μ = β0+β1*M+β2*Bμ.T[0]
    
    # Likelihood
    yi = pm.Normal('yi',μ, σ, observed=kcal)

In [None]:
with milkerMvN:
    trace_mvn = pm.sample(1000)

In [None]:
# Missing value estimates
tmp = trace_mvn.posterior['Bμ'].stack(sample=("chain", "draw")).values
Bl2 = np.percentile(tmp,50,axis=1).mean(1)
Bll902 = np.percentile(tmp,5,axis=1).mean(1)
Blu902 = np.percentile(tmp,95,axis=1).mean(1)
Bll502 = np.percentile(tmp,25,axis=1).mean(1)
Blu502 = np.percentile(tmp,75,axis=1).mean(1)

In [None]:
Bll902.shape

In [None]:
_, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plot observed values
ax[0].scatter(B, kcal, marker='o', facecolor='white', edgecolors='k', linewidth=1)
# Plot MCAR imputed values
ax[0].scatter(Bl[B_.mask], kcal[B_.mask], marker='o', facecolor='black', edgecolors='k', linewidth=1)
[ax[0].plot((d,dl),(a,a), c='grey', zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll90,Blu90)]
[ax[0].plot((d,dl),(a,a), c='grey', linewidth=3, zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll50,Blu50)]
# Plot MAR imputed values
ox = 0.04
ax[0].scatter(Bl2[B_.mask], kcal[B_.mask]+ox, marker='o', facecolor='red', linewidth=1)
[ax[0].plot((d,dl),(a+ox,a+ox), c='red', alpha=0.7, zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll902[B_.mask],Blu902[B_.mask])]
[ax[0].plot((d,dl),(a+ox,a+ox), c='red', alpha=0.7, linewidth=3, zorder=0) for a,d,dl in zip(kcal[B_.mask],Bll502[B_.mask],Blu502[B_.mask])]
# Labels
ax[0].set_xlabel('Neocortex percent (std)', fontsize=15)
ax[0].set_ylabel('kcal milk (std)', fontsize=15)


# Plot observed values
ax[1].scatter(M, B, marker='o', facecolor='white', edgecolors='k', linewidth=1, label='Observed')
# Plot MCAR imputed values
ax[1].scatter(M, Bl, marker='o', facecolor='black', edgecolors='k', linewidth=1, label='MCAR')
[ax[1].plot((a,a),(d,dl), c='grey', zorder=0) for a,d,dl in zip(M,Bll90,Blu90)]
[ax[1].plot((a,a),(d,dl), c='grey', linewidth=3, zorder=0) for a,d,dl in zip(M,Bll50,Blu50)]
# Plot MCAR imputed values
ox=0.06
ax[1].scatter(M[B_.mask]+ox, Bl2[B_.mask], marker='o', facecolor='red',linewidth=1, label='MAR')
[ax[1].plot((a+ox,a+ox),(d,dl), c='red', alpha=0.7, zorder=0) for a,d,dl in zip(M[B_.mask],Bll902[B_.mask],Blu902[B_.mask])]
[ax[1].plot((a+ox,a+ox),(d,dl), c='red', alpha=0.7, linewidth=3, zorder=0) for a,d,dl in zip(M[B_.mask],Bll502[B_.mask],Blu502[B_.mask])]
# Labels
ax[1].set_ylabel('Neocortex percent (std)', fontsize=15)
ax[1].set_xlabel('log(Body mass) (std)', fontsize=15)
plt.legend()
plt.savefig('missingmilk2.jpg',dpi=300);

Now we're cooking - the information coming from body mass, which is correlated with brain size (possibly due to some unmeasured variable), shrinks the MvN estimates further toward the relationship between them (away from midline). 

In [None]:
tmp = pm.summary(trace_m, var_names=['Sigma'])
dfi.export(tmp, 'trace_mcar.jpg')
tmp

In [None]:
tmp = pm.summary(trace_mvn, var_names=['Sigma'])
dfi.export(tmp, 'trace_mar.jpg')
tmp

And we get that small reduction in data variance to boot

# Discretely absent

While estimation of continuous variables follows naturally enough, handline discrete missing values poses it's own particular challenges. First and foremost, discrete variables require discrete distributions, which pose real problems when they're anywhere but as part of the data likelihood in a model. HMC just doesn't do discrete, because it's based on momentum from the gradients, and other algorithms have similar problems because of the knife-edge nature of counts. 

The way around this is through something called a 'weighted average' which involves sampling from discrete posteriors outside of the MCMC bits, avoiding the problem altogether. To see how to do this, we'll put together a really simple model to simulate missingness.

## Discrete cats

The example is from p517 in Rethinking

> Imagine a neighborhood in which every house contains a songbird. Suppose we survey the neighborhood and sample one minute of song from each house, recording the number of notes. You notice that some houses also have house cats, and wonder if the presence of a cat changes the amount that each bird sings. So you try to also figure out which houses have cats. You can do this easily in some cases, either by seeing the cat or by asking a human resident. But in about 20% of houses, you can’t determine whether or not a cat lives there.

Helpfully, we can turn this into a DAG

In [None]:
# Create cat DAG
mDAG = nx.DiGraph()
mDAG.add_edges_from([("Rc", "C*"), ("C", "C*"), ("C", "N")])
# Plot DAG
nx.draw_networkx(mDAG, arrows=True)
plt.tight_layout()

Here the presence of a cat (C) influences the number of sung notes (N) but because of missing values (Rc), we only observe C*. We simulate from a statistical model

$$
\begin{align}
N_i \sim & P(\lambda_i) \\
log(\lambda_i) = & \beta_0 + \beta_1C_i \\
C_i \sim & Bern(k) \\
R_{C,i} \sim & Bern(r)
\end{align}
$$

Which is

In [None]:
# Number of houses
nhouse = 100
# Intercept
b0 = 5
# Cat effet
b1 = -3
# Probability of cat in house
k = 0.7
# Probability of not knowing if there's a cat
r = 0.2


# Simulate cats
cat = np.random.binomial(1,k,nhouse)
# Notes counted
notes = np.random.poisson(b0+b1*cat)

# Unobserved houses
Rc = np.random.binomial(1, r, nhouse)
# Observed cats
cat_obs = cat
# Cat mask
cat_obs[Rc==1] = -999

In [None]:
cat_obs

With these elements in place, we can build a conditional model that removes the uknowns and calculates their likely values later


In [None]:
# Bayesian PyMC3
with pm.Model() as cats:
    # Probability of cat in the house
    k = pm.Beta('k', 2, 2)
    Zi = pm.Bernoulli('Zi',k, observed=cat_obs[Rc==0])
    
    # Average notes absent cat
    β0 = pm.Normal('Intercept', 0, 5)
    # log-odds effect of cat on notes
    β1 = pm.Normal('Cat', 0, 5)

    # Linear model
    λ = pm.math.exp(β0+β1*cat_obs[Rc==0])
    
    # Known cat data likelihood
    Yi = pm.Poisson('Yi', λ, observed=notes[Rc==0])
    
    ## Custom addition to the likelihood
    δ = pm.math.log( pm.math.exp(pm.math.log(k)+pm.Poisson.logp(notes[Rc==1],pm.math.exp(β0+β1))+pm.math.log(1-k)+pm.Poisson.logp(notes[Rc==1],pm.math.exp(β0))).sum() )
    like = pm.Potential('like', δ)

In [None]:
with cats:
    trace_c = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_c)
dfi.export(tmp, 'catspost.jpg')
tmp

Et voila - our estimates include the true probabilty of cat presence (0.7).