## Week 7 Lecture 2 - Monsters and mixtures

McElreath's lectures for today: https://www.youtube.com/watch?v=p7g-CgGCS34

McElreath's lectures for the whole book are available here: https://xcelab.net/rm/statistical-rethinking/

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

An excellent port to Python/PyMC Code is available here: https://github.com/dustinstansbury/statistical-rethinking-2023/tree/main

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import pymc as pm
import patsy
import arviz as az
import networkx as nx
from matplotlib import pyplot as plt
import dataframe_image as dfi


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)


def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def indexall_(L):
    Il, Ll = pd.factorize(L, sort=True)
    return Ll, Il

# Survival Analysis

A key relationship between count models like the Poisson and the data they represent is that the underlying parameters are about rates - how long do we wait (or how wide an area do we search) on average for an event to take place? Therefore another option for quantifying these things lies in modeling times to events - with the trick being that we have to also pay attention to the *censoring* of data. 

1. Left-censored: when you don't know when time started relative to events
2. Right-censored: ending the observation period before the next event occurs
3. Dual-censored: both problems at once

Ignoring the censoring process will also lead to inferrential mistakes, so each survival analysis needs to keep track of their potential effects.

By way of example, let's have a look at some cat data from Austin Texas.


In [None]:
# Import cat data
cdata = pd.read_csv('AustinCats.csv')
dfi.export(cdata.head(), 'catdata.jpg')

In [None]:
cdata.head()

We can see here the data has dates in and out for individual cats, as well as things about their breed etc. Let's say we're interested in the average time to adoption, a useful metric if you're running a cat shelter. If we were modelling just the days to adoption ($D$) and ignoring the censoring, the model would simply be

$$
\begin{align}
D_i \sim & Exponential(\lambda_i)
\end{align}
$$

the underlying rate of adoptions. Let's run that model and see what happens:


In [None]:
# Adoption days
D = cdata.days_to_event.values

In [None]:
with pm.Model() as adopt:
    # Adoption rate
    λ = pm.Exponential('Adoption rate', 1)
    
    # Data likelihood
    Yi = pm.Exponential('Yi', λ, observed=D)

In [None]:
with adopt:
    trace_a = pm.sample(1000)

In [None]:
pm.plot_trace(trace_a)
plt.savefig('rate.jpg',dpi=300);

This adoption rate is 0.04 - what does that mean? Well the mean of an exponetial is $1/\lambda$, so this means that the average days to adoption is

In [None]:
az.plot_density(1/trace_a.posterior['Adoption rate'],hdi_prob=0.999)
plt.axvline(np.mean(D),c='black')
plt.xlabel('Average days to adoption')
plt.savefig('meandays.jpg',dpi=300);

But in our dataset we stopped recording the data at some point, with newly-arrived cats that hadn't yet had a chance to be adopted. So how do we deal with this right-hand censoring? Well for the cats that are adopted we know their number of days to adoption, so no issues there. But for the cats in the data that have yet to adopted, all we know is how long they've been in jail. This second set of cats implies a different quantity, namely the number of days in captivity without yet having been adpoted, which is complement of the exponential (i.e. the complementary cumulative distribution). We can add this into our model as a special component:

$$
\begin{align}
f(c=1,D) = & \lambda e^{-\lambda D} \\
f(c=0,D) = & e^{-\lambda D}
\end{align}
$$

The top line is just the Exponential PDF. The bottom line is the censored PDF, which simply has the leading $\lambda$ parameter removed. This because the uncensorted cumulative distribution function is

$$
1-e^{-\lambda D}
$$

which is the probability of being adopted given time D. So to get the censored bit, this is simply the probability of *not* being adopted by time D, which is

$$
e^{-\lambda D}.
$$

Because of this small difference, we can get the log probability for both kinds of data (censored and not) by adding an indicator variable (c) to show that the cat has or has not been adopted


$$
log(\lambda e^{-\lambda D})-log(e^{-\lambda D})c
$$


$$
log(\lambda)c -\lambda D
$$

cool huh?

In [None]:
# Adoped or not
A = cdata.out_event.values == 'Adoption'

In [None]:
with pm.Model() as adoptC:
    # Adoption rate
    λ = pm.Exponential('Adoption rate', 1)
    
    # Censored/non log-probability
    def CE_logp(value, λ, cens):
        return pm.math.sum(cens*pm.math.log(λ)-λ*value)

    # Censored likelihood
    ExSurv = pm.DensityDist('ExSurv', λ, A, logp=CE_logp, observed=D)

In [None]:
with adoptC:
    trace_aC = pm.sample(1000)

In [None]:
pm.summary(trace_aC)

In [None]:
plt.hist(1/trace_aC.posterior['Adoption rate'].values[0].T,label='Censored',density=True)
plt.hist(1/trace_a.posterior['Adoption rate'].values[0].T,label='Naive',density=True)
plt.xlabel('Average days to adoption')
plt.legend()
plt.savefig('censor.jpg',dpi=300);

This has made a massive difference, with censoring accounting for a 50% decrease in apparent adoption times in the previous model. 


The next question is, are black cats adopted at a lower rate relative to other colours?

In [None]:
_ , Ic = indexall(cdata.color.values=='Black')
Colour = ['Other','Black']

In [None]:
with pm.Model(coords={'Colour':Colour}) as adoptCb:
    # Adoption rate
    λ = pm.Exponential('Adoption rate', 1, dims='Colour')
    
    # Censored/non log-probability
    def CE_logp(value, λ, cens):
        return pm.math.sum(cens*pm.math.log(λ[Ic])-λ[Ic]*value)

    # Censored likelihood
    ExSurv = pm.DensityDist('ExSurv', λ, A, logp=CE_logp, observed=D)

In [None]:
with adoptCb:
    trace_aCb = pm.sample(1000)

In [None]:
pm.plot_forest(trace_aCb);

In [None]:
other = 1/trace_aCb.posterior['Adoption rate'].values[0].T[0]
black = 1/trace_aCb.posterior['Adoption rate'].values[0].T[1]

plt.hist(other, label='Other',density=True)
plt.hist(black, color='black', label='Black',density=True)
plt.legend()
plt.xlabel('Days to adoption')
plt.savefig('black.jpg',dpi=300);

Clearly yes, with the average difference being

In [None]:
plt.hist(black-other, color='purple',density=True)
plt.xlabel('Black cat extra days')
plt.savefig('blackother.jpg',dpi=300);

about 9 additional days.

## Drunken monks

Mixture models have some incredible properties and the ability to distinguish between different sources of zeros is one of them. What do we mean here? Well often processes aren't distinct in how data can arise - many times multiple things happen that can lead to the same observation. Zeros are a classic example - in fisheries for example, zeros can arise because fish weren't there, or because a boat failed to catch fish that were. These kinds of situations arise all the time and as a result a series of **zero inflated** models have been developed to address them. 

By way of example, imagine a monestary where monks are busy copying manuscripts (important work for a monk) and each day a large number of monks finish copying a small number of manuscripts. This kind of thing could easliy be represented as a Poisson process, whereby manuscripts are churned out at a constant underlying rate ($\lambda$), leading to a Poisson distribution that has some proportion of zeros.

But this being a Catholic monestary the monks on occasion turn to drink, upon which days they produce nothing - an additional source of zeros that is unrelated to the typical rate of production. How should we handle this? Well we can model both processes and allow the model to sort out which zeros are likely to arise from which processes. How? Well let's simulate this process for a year

In [None]:
# Number of days
N = 365

# Probability of drinking
p_drink = 0.2
# Rate of manuscript production
work_rate = 1.3

# Simulate work or drink for each day of the year
work = np.random.binomial(1,1-p_drink,N)

# Simulate production of manuscripts
M_ = np.random.poisson(work_rate,N)
M = work*M_

In [None]:
# Manuscripts on work days only
plt.hist(M_, bins=30)
plt.ylim(0,170)
plt.savefig('work.jpg',dpi=300);

In [None]:
# Observed manuscript production
plt.hist(M, bins=30)
plt.ylim(0,170)
plt.savefig('manuscripts.jpg',dpi=300);

In [None]:
# ZI manuscript production
z_work = (M_==0).sum()
z_total = sum(M==0)

plt.hist(M, bins=30)
plt.plot((0.09,0.09),(z_work, z_total),linewidth=10, c='red')
plt.ylim(0,170)
plt.savefig('zimonk.jpg',dpi=300);

Ok, with our simulated data in place, we need to develop a likelihood that reflects both processes, a zero-inflated Poisson. There is a ZIP model built into both Stan and PyMC, but it's fairly simple to define so let's do that for our model here as well, just to get a bit of practice and the knowledge that **we're allowed** to build whatever model we like - there are no rules beyond coherence.

From our simulation we have the elements we'll need, a binomial for the workin' vs drinkin' days, and a Poisson for the manuscript production rate. First the bionomial pmf (probability **mass** function) for 1/0 data is a Bernoulli

$$
p^{k}(1-p)^{1-k}
$$

and the pmf for a Poisson is

$$
\frac{\lambda^{k}e^{-\lambda}}{k!}
$$

So the probability of a zero is

$$
p + (1-p)\frac{\lambda^{0}e^{-\lambda}}{0!} = p + (1-p)e^{-\lambda}
$$

And the probabilty of a count (i.e. when $k\neq0$) is

$$
(1-p)\frac{\lambda^{k}e^{-\lambda}}{k!}
$$

the Poisson multiplied by the probability of working. Fortunately for us this is an available likelihood in PyMC3 - note however that the zero-inflation is the probability of counts (not zeros):

In [None]:
with pm.Model() as Monks:
    # Manuscript log-scale rate
    γ = pm.Normal('logMrate', 2, .5)
    
    # log-odds of drinking
    μ = pm.Normal('loDrink',-1.5,1)
    
    # link functions
    λ = pm.Deterministic('Mrate', pm.math.exp(γ))
    p = pm.Deterministic('pDrink', pm.invlogit(μ))
    
    # Likelihood
    Yi = pm.ZeroInflatedPoisson('Yi', 1-p, λ, observed=M)

In [None]:
with Monks:
    trace_m = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_m)
dfi.export(tmp, 'ziptable.jpg')
tmp

In [None]:
plt.hist(trace_m.posterior['pDrink'].values.flatten(), density=True)
plt.axvline(p_drink,c='red',lw=4)
plt.xlabel('P(drinking)')
plt.savefig('Pdrinking.jpg',dpi=300);

So we were able to recapture the simulated data and see that monks drink about 20% of the time. A question is - what would we have estimated if we'd just used the Poisson?

In [None]:
with pm.Model() as PMonks:
    # Manuscript log-scale rate
    γ = pm.Normal('logMrate', 1, 0.5)

    # link function
    λ = pm.Deterministic('Mrate', pm.math.exp(γ))
    
    # Likelihood
    Yi = pm.Poisson('Yi', λ, observed=M)

In [None]:
with PMonks:
    trace_pm = pm.sample(1000)

In [None]:
pm.summary(trace_pm)

In [None]:
axes = az.plot_forest([trace_m, trace_pm], model_names=["ZIP", "Poisson"], var_names=["Mrate"])
plt.tight_layout()
plt.savefig('monksmodels.jpg',dpi=300);

As you might expect, all those drinking zeros get sunk into the Poisson density and what we estimate is the Monk's overall (both drinking and working days) rate of manuscript production. What would WAIC say comparing these two models?

In [None]:
# WAIC-based model comparison
with Monks:
    pm.compute_log_likelihood(trace_m)
with PMonks:
    pm.compute_log_likelihood(trace_pm)
comp_WAIC = pm.compare({'Poisson': trace_pm, 'ZIP': trace_m}, ic='waic')
dfi.export(comp_WAIC, 'comp_WAIC.jpg')
comp_WAIC

Most of the model weight favours zero-inflation. Note that depending on your simulation above, you may randomly get a dataset that doesn't conform quite as well. Re-run the simulation and see what happens. None of this stuff is easy...

# Multinomial and categorical models

While binomials are commonplace for looking at 0/1 data, successes in trials, categorical data can apply to mutliple outcomes, where quantities end up on distinct bins. 

Let's look at this kind of data through a simulated example, whereby we simulate various career choices based on expected income. We can port over McElreath's code from Chatper 11 to do so:

In [None]:
# Simulate career choices for 500 people
N = 500
# Expected income for each career
income = np.array([1,2,5])
# Score for each career based on income
bx = 0.4
score = 0.4*income

# Convert scores to probabilities
p = sp.special.softmax(score)

In [None]:
p

In [None]:
# Simulate choices based on underlying income
career = np.random.multinomial(1,p,N)
# Put choice into categorical array
career = np.where(career==1)[1]
career

So with these in place, we can develop a regression model to figure out what the effect of expected income is on career choice:

In [None]:
with pm.Model() as Careerz:
    # Intercepts
    a = pm.Normal('a', 0, 1, shape=2)
    # Income slope
    b = pm.Lognormal('b',0, 0.5)
    
    # Linear model
    s0 = a[0]+b*income[0]
    s1 = a[1]+b*income[1]
    s = pm.Deterministic('s', pm.math.stack([s0,s1,0]))
    
    # Data likelihood
    Yi = pm.Categorical('Yi', pm.math.softmax(s), observed=career)

In [None]:
with Careerz:
    trace_c = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_c)
dfi.export(tmp, 'multinom.jpg')
tmp

Great - we have results, but unfortunately the nature of multinomial regression is that effect sizes are really difficult to interpret, because they're relative to the baseline category used (the 'pivot'). So instead let's look at the posterior predicted probabilities for each category, relative to our known values:

In [None]:
trace_c.posterior['s']

In [None]:
# Estimated probabilities
smax  = np.array([sp.special.softmax(x) for x in trace_c.posterior['s'].values[0]])
sp.special.softmax(trace_c.posterior['s'].values[0].mean(0))

In [None]:
[plt.hist(s, density=True) for s in smax.T]
[plt.axvline(x,c='black',lw=1,ls=':') for x in p]
plt.xlabel('P(career)')
plt.savefig('Pcareer.jpg',dpi=300);

These look great - we've captured the probabilities of choosing each career. But that's a bit unsatisfying, as what we really want to know is the relationship between incomes and choosing each careear. To gain some insight to this, the thing to do is simulate what would happen to these probabilites if, say, the income for career=1 doubles:

In [None]:
# Grab posteriors
a_post = trace_c.posterior['a'].values[0].T
s_post = trace_c.posterior['s'].values[0].T
b_post = trace_c.posterior['b'].values[0]

In [None]:
plt.hist(b_post)
plt.axvline(bx,c='red',lw=4)
plt.xlabel('Income effect on career')
plt.savefig('Pcareerb.jpg',dpi=300);

In [None]:
# Set up logit scores
s0x = a_post[0]+b_post*income[0]
s1x = a_post[1]+b_post*income[1]
s1x_new = a_post[1]+b_post*income[1]*2

In [None]:
# Calculate probabilites from original income and new income
p_orig = np.array([sp.special.softmax(x) for x in np.array([s0x,s1x,np.zeros(len(s1x))]).T])
p_new = np.array([sp.special.softmax(x) for x in np.array([s0x,s1x_new,np.zeros(len(s1x))]).T])

So plotting our model posteriors against our 

In [None]:
[plt.hist(p_orig.T[i], density=True, label=str(i)) for i in range(len(a_post)+1)]
plt.legend()
plt.savefig('Pcareer1.jpg',dpi=300);

In [None]:
[plt.hist(s, density=True) for s in p_new.T]
plt.savefig('Pcareer2.jpg',dpi=300);