## Problem 7.2: MCMC with Boolean data (40 pts)

In [Homework 6](hw6.html), you investigated a data set on reversals of optogenetic worms upon exposure to blue light. As a reminder, here are the data.

|Strain|Year|Trials|Reversals|
|:---:|:---:|:---:|:---:|
|WT|2017|55|7|
|ASH|2017|54|18|
|AVA|2017|52|28|
|WT|2016|36|6|
|ASH|2016|35|12|
|AVA|2016|36|30|
|WT|2015|35|0|
|ASH|2015|35|9|
|AVA|2015|36|33|

Again, for the purposes of this problem, assume that we can pool the results from the two years to have 13/126 reversals for wild type, 39/124 reversals for ASH, and 91/124 reversals for AVA.

The pertinent parameter is $\theta$, the probability of reversal of a worm upon illumination.

**a)** Use Stan to get samples of $\theta$ for each of the three strains. Plot either histograms or ECDFs of your samples.

In [1]:
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
import pandas

## Prior predictive check
I wrote the stan prior-predictive model, and I'll show it here. Again, we are modeling the number of reversals as a binomial distribution, such that the probability of reversal has a beta-distributed prior. The parameters of the beta distribution are different for each strain of worm. 

In [7]:
sm = bebi103.stan.StanModel(file='./7.2_prior_pred.stan')
print(sm.model_code)

Using cached StanModel.
data{
    int N;
    real alpha[N];
    real beta_[N];
    int trials[N];
}

generated quantities{
    // Parameters
    real theta[N];
    real rev_prob[N];
    for (i in 1:N) {
        theta[i] = beta_rng(alpha[i], beta_[i]);
    }

    // Data
    for (i in 1:N) {
        rev_prob[i] = binomial_rng(trials[i], theta[i]) / trials[i];
    }
}


I will pass this generative model the parameters of the experiment and generate data. 

In [4]:
# arrays for alpha and beta values for our three strains
a_array = [1, 2, 5]
b_array = [8, 7, 6]

# arrays for the datapoints we have for number of reversals for our three strains
n_array = [13, 39, 91]
N_array = [126, 124, 124]

# Store input parameters in a dictionary so stan can access them
data = dict(N=3,
            alpha = a_array,
            beta_ = b_array,
            trials = N_array)

# Generate samples
samples_gen = sm.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=1000)
# Store samples in a dataframe
df_gen = bebi103.stan.to_dataframe(samples_gen, diagnostics=False)
df_gen.head()

Unnamed: 0,chain,chain_idx,warmup,theta[1],theta[2],theta[3],rev_prob[1],rev_prob[2],rev_prob[3],lp__
0,1,1,0,0.135798,0.38162,0.620381,0.0,0.0,0.0,0.0
1,1,2,0,0.017348,0.182009,0.307988,0.0,0.0,0.0,0.0
2,1,3,0,0.049301,0.281659,0.341386,0.0,0.0,0.0,0.0
3,1,4,0,0.039983,0.038833,0.525712,0.0,0.0,0.0,0.0
4,1,5,0,0.067805,0.207825,0.493003,0.0,0.0,0.0,0.0


It looks like the sampling worked well. Let's plot the distribution for each theta.

In [5]:
def show_ecdf(n, name, color, p):
    return bebi103.viz.ecdf(n, 
                            color = color,
                            legend= name,
                            alpha=1, 
                            line_alpha=0,
                            p=p)
p = bokeh.plotting.Figure(x_axis_label='probability of reversal')
p = show_ecdf(df_gen["theta[1]"].values, "WT", "blue", p)
p = show_ecdf(df_gen["theta[2]"].values, "ASH", "red", p)
p = show_ecdf(df_gen["theta[3]"].values, "AVA", "black", p)
p.legend.location = "bottom_right" 
bokeh.io.show(p)

This prior-predictive check makes perfect sense. We expect most wildtype reversal probabilities to be very low, and ASH and AVA probabilities to each be higher than one another. I will now write a model to perform sampling from the posterior. 

In [9]:
sm_mcmc = bebi103.stan.StanModel(file='./7.2_mcmc.stan')
print(sm_mcmc.model_code)

Using cached StanModel.
data{
    int N;
    int WT;
    int ASH;
    int AVA;
    int trials[N];
}

parameters{
    real theta_1;
    real theta_2;
    real theta_3;
}

model{
    // Priors
    theta_1 ~ beta(1, 8);
    theta_2 ~ beta(2, 7);
    theta_3 ~ beta(5, 6);

    // Likelihood
    WT ~ binomial(trials[1], theta_1);
    ASH ~ binomial(trials[2], theta_2);
    AVA ~ binomial(trials[3], theta_3);
}


In [10]:
data = dict(N = 3,
            WT = 13,
            ASH = 39, 
            AVA = 91,
            trials = N_array)
samples = sm_mcmc.sampling(data=data)
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False)
df_mcmc.head()

Unnamed: 0,chain,chain_idx,warmup,theta_1,theta_2,theta_3,lp__
0,1,1,0,0.20204,0.263351,0.716086,-208.629597
1,1,2,0,0.186647,0.289871,0.769447,-207.949015
2,1,3,0,0.063302,0.314174,0.657284,-204.914658
3,1,4,0,0.084025,0.393763,0.686412,-205.36658
4,1,5,0,0.157414,0.419136,0.668485,-209.021598


Now I can plot my samples from the posterior!

In [11]:
p = bokeh.plotting.Figure(x_axis_label='probability of reversal')
p = show_ecdf(df_mcmc["theta_1"].values, "WT", "blue", p)
p = show_ecdf(df_mcmc["theta_2"].values, "ASH", "red", p)
p = show_ecdf(df_mcmc["theta_3"].values, "AVA", "black", p)
p.legend.location = "bottom_right" 
bokeh.io.show(p)

**b)** Use your Metropolis-Hastings sampler from the previous problem to do the same.

**c)** The posterior plots of $\theta$ are illuminating, but suppose we want to quantify *the difference* in reversal probability between the two strains, say strain 1 and strain 2. That is, we want to compute $g(\delta_{12}\mid n_1, N_1, n_2, N_2)$, where $\delta_{12} \equiv \theta_2 - \theta_1$. Note that computing this distribution by hand is quite difficult.