# Credits
Almost all Maria, Jared helped with defining priors, displaying ECDFs, and explanations. 

In [2]:
import numpy as np
import scipy.stats as st
import scipy.special

import bebi103

import pandas as pd
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
import pystan

Features requiring DataShader will not work and you will get exceptions.
  Features requiring DataShader will not work and you will get exceptions.""")


## Problem 6.1: Modeling and parameter estimation for Boolean data (40 pts)

In this problem, we will work with data of the True/False type. Lots of data sets in the biological sciences are like this. For example, we might look at a certain mutation in *Drosophila* that affects development and we might check whether or not eggs hatch.

The data we will use comes from an experiment we have done for the last few years in [Bi 1x](http://bi1x.caltech.edu/) here at Caltech. The experiment was developed by Meaghan Sullivan. We studied a neural circuit in *C. elegans* using optogenetics.

A neural circuit is a series of interconnected neurons that create a pathway to transmit a signal from where it is received to where it causes a behavioral response in an animal.  An example is the neural circuit involved in reversals in *C. elegans*.  This circuit consists of three types of neurons: sensory neurons receive stimuli from the environment, command interneurons integrate information from many sensory neurons and pass a signal to the motor neurons, and motor neurons that control worm behavior, such as reversals.

There are six non-motor neurons acting in a circuit that responds to environmental cues and triggers a reversal, a shown in the figure below (based on [Schultheis et. al. 2011](https://doi.org/10.1371/journal.pone.0018766)).  These include four sensory neurons (ALM, AVM, ASH, and PLM) and two interneurons (AVD and AVA).  Each sensory neuron is sensitive to a different type of stimulus.  For example, the sensory neuron we are studying (ASH) is sensitive to chemosensory stimuli such as toxins, while another neuron (PLM) is sensitive to mechanical stimuli (touch) in the posterior part of the worm's body.  The sensory neurons send signals that are integrated by the two command interneurons (AVA and AVD).  Each sensory neuron can provide an impulse to the command interneurons at any time.  In order for the command interneuron to fire and activate motor neurons, the sum of the stimuli at any point in time must exceed a certain threshold.  Once the stimuli from one or more sensory neurons has induced an action potential in a command interneuron, that signal is passed to motor neurons which will modulate worm behavior.

![Reversal neural network](reversal_neural_network.png)

In the experiment, we used optogenetics to dissect the function of individual neurons in this circuit.  We worked with two optogenetic worm strains.  The ASH strain has channelrhodopsin (ChR2, represented by a red barrel in the figure above) expressed only in the ASH sensory neuron.  When we shine blue light on this strain, we should activate the ChR2, which will allow sodium and calcium cations to flow into the neuron, simulating an action potential. We want to quantify how robustly this stimulation will cause the worm to exhibit aversion behavior and reverse.  

We also studied an AVA strain that has channelrhodopsin expressed only in the AVA command interneuron.  Our goal is to quantify the effects of stimulating this neuron in terms of reversals compared to the ASH neuron and to wild type.

The True/False data here are the whether or not the worms undergo a reversal. Here is what the students observed.

|Strain|Year|Trials|Reversals|
|:---:|:---:|:---:|:---:|
|WT|2017|55|7|
|ASH|2017|54|18|
|AVA|2017|52|28|
|WT|2016|36|6|
|ASH|2016|35|12|
|AVA|2016|36|30|
|WT|2015|35|0|
|ASH|2015|35|9|
|AVA|2015|36|33|

For the purposes of this problem, assume that we can pool the results from the three years to have 13/126 reversals for wild type, 39/124 reversals for ASH, and 91/124 reversals for AVA.

Our goal is to estimate $\theta$, the probability of reversal for each strain. That is to say, we want to compute $g(\theta\mid n, N)$, where $n$ is the number of reversals in $N$ trials.


**a)** Develop a generative model (that is, specify the joint distribution $\pi(n, \theta\mid N) = f(n,\mid \theta, N)\,g(\theta)$) for the observed reversals. Be sure to do prior predictive checks and justify why you chose the model you did. *Biological hint*: *C. elegans* have no mode of sensing light at all. So, a wild type worm without and Channelrhodopsin has no means of detecting light. *Modeling hint*: The Beta distribution is very useful for modeling probabilities of probabilities, like *θ* in this problem.

<br />


We will estimate the prior for the parameter θ, the probability of reversal for each strain. We choose the beta distribution to model this as it is useful for modeling rare events and probabilities of probabilities. We will vary the a and b values in the beta distribution based on our estimates of how likely it is for a strain to reverse. 


We estimate that the AVA strain is twice as likely to reverse as the ASH strain as it is sensitive to the input of ASH and PLM while ASH is only sensitive to chemosensory stimuli such as toxins.

We estimate that wild type will reverse very rarely as it is not influenced by light since it has no means of detecting light, so $\theta_{WT}$ = Beta(1,8), 
shown graphically as:

In [3]:
sigma = np.linspace(0, 1, 200) # generate points
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='θ', 
                          y_axis_label='g(θ)')
p.line(sigma, st.beta.pdf(sigma, 1, 8, loc=0, scale=1), line_width=2)
bokeh.io.show(p)

<p style="background-color:#D9EDF7">
Good! 5/5
</p>

The ASH strain will still reverse with probability < 0.5,  so $\theta_{ASH}$ = Beta(2,7), shown graphically as: 

In [4]:
sigma = np.linspace(0, 1, 200) # generate points
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='θ', 
                          y_axis_label='g(θ)')
p.line(sigma, st.beta.pdf(sigma, 2, 7, loc=0, scale=1), line_width=2)
bokeh.io.show(p)

<p style="background-color:#D9EDF7">
Nice! 5/5
</p>

The AVA strain is twice as likely to reverse so we predict the distribution will be shifted to the right and more widely peaked, so $\theta_{AVA}$ = Beta(5,6), 
shown graphically as:

In [5]:
sigma = np.linspace(0, 1, 200) # generate points
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='θ', 
                          y_axis_label='g(θ)')
p.line(sigma, st.beta.pdf(sigma, 5, 6, loc=0, scale=1), line_width=2)
bokeh.io.show(p)

<p style="background-color:#D9EDF7">
Looks reasonable! 5/5
</p>

Now we plot the posterior probability density function for each of the three strains. We put our priors for a and b and the data points we have for n number of reversals in N trials into arrays to make performing calculations for all three strains easier.

In [6]:
# arrays for alpha and beta values for our three strains
a_array = [1, 2, 5]
b_array = [8, 7, 6]

# arrays for the datapoints we have for number of reversals for our three strains
n_array = [13, 39, 91]
N_array = [126, 124, 124]

Now we can do our prior predictive check.

In [7]:
n_ppc_samples = 1000

for i in range(3):
    # Draw parameters out of the prior
    theta = np.random.beta(a_array[i], b_array[i], size=100)
    # Draw data sets out of the likelihood for theta we drew out of the prior
    n = [np.random.binomial(N_array[i], t)/N_array[i] for t in theta]
    
    # graph all three 
    if i == 0:
        prior = bebi103.viz.ecdf(n, 
                             x_axis_label='probability of reversal',
                             color = "#f46d43",
                             legend= "WT",
                             alpha=1, 
                             line_alpha=0)
    elif i == 1:
        prior = bebi103.viz.ecdf(n, 
                             x_axis_label='probability of reversal',
                             p = prior,
                             color = "#249d43", 
                             legend= "ASH",
                             alpha=1,
                             line_alpha=0)
    else:
        prior = bebi103.viz.ecdf(n, 
                             x_axis_label='probability of reversal',
                             p = prior,
                             legend= "AVA",
                             alpha=1,
                             line_alpha=0)
bokeh.io.show(prior)

This looks how we expected, with the WT having very low probabilities of reversal, the ASH having higher probabilities, and the AVA being about twice as likely as the ASH to reverse, so we can now move on to our analysis!

<p style="background-color:#D9EDF7">
This looks perfect! 5/5
</p>

<p style="background-color:#D9EDF7">
I see from your code that you picked binomial for the reversal model, but you should have discussed this explicitly. For missing that I'll just take off 1 point. (4/5 for choosing the model).
    Overall for (a): 24/25
</p>

**B) Plot the posterior probability density function for each of the three strains. What can you conclude from this?**

In [8]:
# we will sample values for 200 thetas from 0.01 to 0.99
theta_array = np.linspace(0.01, 0.99, num=200)

output = [] # will contain three arrays, sampled from the three strands

for i in range(3):
    pp = []

    for t in theta_array:
        # Prior on each theta
        log_prior = st.beta.logpdf(t, a_array[i], b_array[i])

        # The log_likelihood is just the binomial distribution
        log_like = np.sum(st.binom.logpmf(n_array[i], N_array[i], t))

        pp.append(log_prior + log_like)
    
    # append the posterior data points array for one stand to the output array
    output.append(pp)  # need to exp becuase you don't want to plot the log

We can compute the normalization constant (conceptually, the area under the curve of our line). Since we want to plot the probability density, we want the area under the curve to be equal to 1, so we divide our trial values by the normalization constant.

In [9]:
# function that takes in the raw log_prior + log_like data and 
# normalizes it using a marginalization constant.
def normalize(data):
    trial = np.exp((data - np.array(data).max()))
    
    # Compute marginalized posterior
    norm_constant = np.trapz(trial, x=theta_array)
    # Normalize raw data
    trial_norm = trial / norm_constant
    return trial_norm

<p style="background-color:#D9EDF7">
please use markdown cells between everey code cell!
</p>

In [10]:
# call the above function for each strand to get the normalized output
trial_wt_norm = normalize(output[0])
trial_ash_norm = normalize(output[1])
trial_ava_norm = normalize(output[2])

We can now plot our completed posterior probability density functions for each strain.

In [11]:
p = bokeh.plotting.figure(width=600, height=400, 
                          x_axis_label='probability of reversal (theta)', 
                          y_axis_label='p(theta)')

p.line(theta_array, trial_wt_norm, line_color="#f46d43", legend= "WT", line_width=2)
p.line(theta_array, trial_ash_norm, line_color="#249d43", legend= "ASH", line_width=2)
p.line(theta_array, trial_ava_norm, legend= "AVA", line_width=2)

bokeh.io.show(p)

From this plot we can conclude that our priors combined with out likelihood did a good job at approximating the probability of theta. We've obtained the expected result based on the three data points we were given: WT < ASH < AVA in terms of probability of theta.

<p style="background-color:#D9EDF7">
You have some good discussion of the biology earlier in the problem, but should expand your description of the results here at the end. 10/10 for the plot, 4/5 for the biological explanation.
    
</p>

<p style="background-color:#D9EDF7">
Overall score: 38/40
</p>