# Credits: 
Jared thought out most of the priors, and Jared and John discussed the approach to the problem.

John did the rest of the assignment. 

## Problem 6.3: Analysis of FRAP data (40 pts)

In [homework set 4](hw4.html#Problem-4.1:-Analysis-of-FRAP-data-(40-pts)), we began analyzing a FRAP experiment by Nate Goehring and corworkers. You performed image analysis to obtain the mean fluorescence of the bleach spot versus time. In this problem, you will use those data to obtain estimates for the diffusion coefficient $D$ and chemical rate constant $k_\mathrm{off}$ for the PH-PLCd1/PIP2 complex.

As a reminder, we are taking a simplified approach, but there is more sophisticated analysis we can do to get better estimates for the phenomenological coefficients. These are discussed in the [Goehring, et al. paper](http://dx.doi.org/10.1016/j.bpj.2010.08.033). Instead, we will use the the mean fluorescence of the bleached region, $I(t)$ to perform our analysis. As derived in their paper, 

\begin{align}
I_\mathrm{norm}(t) \equiv I(t)/I_0 &= 
f_f\left(1 - f_b\,\frac{4 \mathrm{e}^{-k_\mathrm{off}t}}{d_x d_y}\,\psi_x(t)\,\psi_y(t)\right),\\[1mm]
\text{where } \psi_i(t) &= \frac{d_i}{2}\,\mathrm{erf}\left(\frac{d_i}{\sqrt{4Dt}}\right)
-\sqrt{\frac{D t}{\pi}}\left(1 - \mathrm{e}^{-d_i^2/4Dt}\right),
\end{align}

where $d_x$ and $d_y$ are the extent of the photobleached box in the $x$- and $y$-directions, $f_b$ is the fraction of fluorophores that were bleached, $f_f$ is the fraction of total fluorescent species left after photobleaching, and $\mathrm{erf}(x)$ is the [error function](http://en.wikipedia.org/wiki/Error_function). Here, $I_0$ is the mean fluorescence in the bleach spot before bleaching. Note that this function is defined such that the photobleaching event occurs at time $t = 0$.

Your task in this problem is to develop a generative model and then to find estimates for the parameters of the model. We will revisit this problem again later in the course and build a hierarchical model. For this problem, consider each of the eight trials separately and use optimization to find the MAP parameter values for each trial and make a Gaussian approximation of the posterior to give an approximate 95% credible region.

You should have acquired a data set of mean fluorescence versus time, and you should use that for your analysis. If you do not have that data set, you can download those generated in the [solutions](http://bebi103.caltech.edu/2018_protected/hw_solutions/hw4_solutions.html) [here](http://bebi103.caltech.edu/2018_protected/hw_solutions/hw_4.1_frap_image_processing_results.csv).

In [1]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import bebi103

import altair as alt
import bokeh.plotting
import bokeh.io
from bokeh.palettes import all_palettes 
from bokeh.models import Legend, LegendItem
bokeh.io.output_notebook()

I will start by importing the data from assignment 4.1. 

In [2]:
df = pd.read_csv("./FRAP_intensities.csv").drop(columns = ["Unnamed: 0"])
df.head()

Unnamed: 0,time_(s),FRAP_exp_0,FRAP_exp_1,FRAP_exp_2,FRAP_exp_3,FRAP_exp_4,FRAP_exp_5,FRAP_exp_6,FRAP_exp_7
0,0.0,1258351.0,1313857.0,1298173.0,1130039.0,1422950.0,1426729.0,1315717.0,1339639.0
1,0.188,1226491.0,1319980.0,1327087.0,1111820.0,1400194.0,1412738.0,1346119.0,1346634.0
2,0.376,1243423.0,1321047.0,1285086.0,1149348.0,1405642.0,1417820.0,1331615.0,1339013.0
3,0.564,1260316.0,1324294.0,1267383.0,1106456.0,1413989.0,1374510.0,1325700.0,1368417.0
4,0.752,1231358.0,1305456.0,1252818.0,1105251.0,1396514.0,1417696.0,1318447.0,1367902.0


It has not been normalized, and I'd like to make the total intensity values in the viscinity of 1. The data came from a 10-bit camera, so the max pixel intensity is given by $2^{10}-1 = 1023$. Thus, I will divide all intensities by the area of the roi ($40^2$) and 1000. 

In [3]:
for exp in range(0, 8):
    name = "FRAP_exp_%i"%exp
    df[name] = df[name].values / (1000 * 40 * 40)
df.head()

Unnamed: 0,time_(s),FRAP_exp_0,FRAP_exp_1,FRAP_exp_2,FRAP_exp_3,FRAP_exp_4,FRAP_exp_5,FRAP_exp_6,FRAP_exp_7
0,0.0,0.786469,0.821161,0.811358,0.706274,0.889344,0.891706,0.822323,0.837274
1,0.188,0.766557,0.824987,0.829429,0.694887,0.875121,0.882961,0.841324,0.841646
2,0.376,0.777139,0.825654,0.803179,0.718342,0.878526,0.886138,0.832259,0.836883
3,0.564,0.787698,0.827684,0.792114,0.691535,0.883743,0.859069,0.828562,0.855261
4,0.752,0.769599,0.81591,0.783011,0.690782,0.872821,0.88606,0.824029,0.854939


Photobleaching occurs at $t = 3.760$, but we want this time to be $t=0$. Thus, I will subtract 3.76 from all times. 

In [4]:
df["time_(s)"] = df["time_(s)"].values - 3.76
df.head()

Unnamed: 0,time_(s),FRAP_exp_0,FRAP_exp_1,FRAP_exp_2,FRAP_exp_3,FRAP_exp_4,FRAP_exp_5,FRAP_exp_6,FRAP_exp_7
0,-3.76,0.786469,0.821161,0.811358,0.706274,0.889344,0.891706,0.822323,0.837274
1,-3.572,0.766557,0.824987,0.829429,0.694887,0.875121,0.882961,0.841324,0.841646
2,-3.384,0.777139,0.825654,0.803179,0.718342,0.878526,0.886138,0.832259,0.836883
3,-3.196,0.787698,0.827684,0.792114,0.691535,0.883743,0.859069,0.828562,0.855261
4,-3.008,0.769599,0.81591,0.783011,0.690782,0.872821,0.88606,0.824029,0.854939


The data looks good! 

Let's think carefully about our model. 

The likelihood of each datapoint is given by the theoretical prediction, plus some guassian error. Thus, we have the following:
$$\text{Likelihood} = \text{Norm}(I(t), \sigma).$$ 

In total, we now have six parameters to estimate: $I_0$, $f_b$, $f_f$, $k_\mathrm{off}$, $D$, and $\sigma$.

$I_0$ = Norm(1, .25): This is the value of the function prior to time point 0. It should be ~1, so accounting for the variation in measurement we will represent it as a Gaussian about 1, being careful to keep the distribution tight enough to not include negative values. 


$f_f$ = Beta(5.5, 1.4): This is the fraction of total fluorophores left after photobleaching, so it is the value of $I(0)/I_0$. Since is is between 0 and 1, this should be easily modeled as a beta distribution that leans heavily towards 1 (we expect that the bleached region is small in comparison to the cell). 


$f_b$ = Beta(10, 1.4): This is the fraction of fluorphores within the bleached region that were bleached. It must be between 0 and 1, so a beta distribution would make sense. Additionally, we expect the bleached region to be almost entirely bleached, so this should lean very heavily towards 1. 

$D$ = LogNorm(ln(1.7), .75): From the paper we expect the diffusion coefficient to be in the viscinity of 1.7 $\frac{\mu \text{m}^2}{\text{s}}$, but we wand a lot of variability, and also to keep it greater than zero. Thus, log-normal seemed like a reasionable choice. 

$k_\mathrm{off}$ = LogNorm(ln(.12), .75): Also from the paper, we expect $k_\mathrm{off}$ to be around .12 $\text{s}^{-1}$, and we made it lognormal for the same reasons as we made $D$ lognormal. 

$\sigma$ = Norm(.1, .1): We expect a variability in measurement of about .1, and to vary from that by about .1. 

Thankfully we know $d_x$ and $d_y$ exactly. 
$$d_x = d_y = 40 \text{ pixels} * \frac{.138 \mu \mathrm{m}}{1\text{ pixel}} = 5.52 \,\mu\text{m}$$

Now I will code up the likelihood function. 

In [5]:
d = 5.52 #μm

# Since dx = dy, I can make one psi function and re-use it
def psi(params, t):
    I_0, f_b, f_f, k_off, D, sigma = params
    temp = d / 2 * scipy.special.erf(d / np.sqrt(4 * D * t)) 
    temp -= np.sqrt(D * t / np.pi) * (1 - np.exp(-d * d / (4 * D * t)))
    return temp

def I(params, t):
    # Unpack parameters
    I_0, f_b, f_f, k_off, D, sigma = params
    # If the time is not yet at photobleaching, the model predicts intensity I_0
    if t<0:
        return I_0
    # Otherwise return the post-photobleaching result
    return I_0 * f_f * (1 - f_b * (4 * np.exp(-k_off * t) / (d * d) * np.power(psi(params, t), 2)))

I'd like to do a prior predictive check on our model before working with any data. This should be simple enough. 

In [6]:
num_samples = 100
time = df["time_(s)"].values
p = bokeh.plotting.Figure(x_axis_label = "Time (s)",
                          y_axis_label = "Fluorescence Intensity")
for sample in range(0, num_samples):
    # Draw each parameter from its distribution
    I_0 = 0.25 * np.random.randn() + 1
    f_b = np.random.beta(10, 1.4)
    f_f = np.random.beta(5.5, 1.4)
    D = np.random.lognormal(mean = 1.7, sigma = .75)
    k_off = np.random.lognormal(mean = .12, sigma = .75)
    sigma = 0.1 * np.random.randn() + 0.1
    
    # Combine parameters into an array
    params = [I_0, f_b, f_f, k_off, D, sigma]
    
    # Plot each model
    model = np.zeros(len(time))
    for i, t in zip(range(0, len(time)), time):
        model[i] = I(params, t)
    p.line(time, model, alpha = 0.3)

bokeh.io.show(p)

  
  import sys


Our prior predictive check looks great! Let's turn our focus to finding MAP parameters. The first step is to define the negtive log posterior so that we can minimize it. 

In [7]:
def neg_log_post(params, intensities, time):
    """Negative log posterior for FRAP analysis."""
    # Make sure parameters are physical
    if (params < 0).any():
        return np.inf
    
    # Unpack params
    I_0, f_b, f_f, k_off, D, sigma = params
    
    # Calculate the log likelihood
    post = 0
    for t, intensity in zip(time, intensities):
        post += st.norm.logpdf(intensity, loc=I(params, t), scale=sigma)
    
    # Add the log priors for each parameter
    # I_0
    post += st.norm.logpdf(I_0, loc=1, scale=0.25)
    # f_b
    post += st.beta.logpdf(f_b, 10, 1.4)
    # f_f
    post += st.beta.logpdf(f_f, 5.5, 1.4)
    # k_off
    post += st.lognorm.logpdf(k_off, .75, scale = np.exp(.12))
    # D
    post += st.lognorm.logpdf(D, .75, scale = np.exp(1.7))
    # sigma
    post += st.norm.logpdf(sigma, loc = 0.1, scale = 0.1)
    
    return -1 * post

Now I just need to define initial parameters and run `scipy.optimize.minimize()` to find the MAP parameters that minimize the negative log posterior. 

In [8]:
# Define initial parameters
params_0 = [1, .9, .9, .12, 1.7, .1]

# Create Dataframe for storing MAP parameters
cols = ["Exp_num", "I_0", "f_b", "f_f", "k_off", "D", "sigma"]
df_params = pd.DataFrame(columns = cols)

for exp_num in range(0, 8):
    # Establish arguments for each experiment
    args = (df["FRAP_exp_%i"%exp_num].values, df["time_(s)"].values)

    # Compute the MAP
    res = scipy.optimize.minimize(neg_log_post, 
                                  params_0,
                                  args=args,
                                  method='powell')
    
    parameters = np.append(np.array([exp_num], dtype = int), res.x)
    df_params = df_params.append(pd.DataFrame(data=[parameters],columns = cols))
df_params.index = np.ndarray.astype(df_params["Exp_num"].values, int)
df_params = df_params.drop(columns = ["Exp_num"])
df_params

  
  import sys
  tmp2 = (x - v) * (fx - fw)


Unnamed: 0,I_0,f_b,f_f,k_off,D,sigma
0,0.774695,0.852111,0.940232,0.237972,0.464192,0.010252
1,0.821877,0.86659,0.966741,0.131217,0.846614,0.01079
2,0.791092,0.862135,0.968034,0.174463,0.632369,0.01028
3,0.692781,0.797253,0.907521,0.344823,0.310754,0.009305
4,0.868938,0.83467,0.906512,0.257519,0.185162,0.009466
5,0.872772,0.820365,0.910307,0.291399,0.258721,0.009308
6,0.81876,0.822883,0.955317,0.251312,0.313006,0.010234
7,0.836397,0.811341,0.891754,0.386163,0.1491,0.009635


Now I will find and report 95% confidence intervals. 

In [9]:
# Define a log_posterior function because I need it for the hessian
def log_post(params, intensities, time):
    return -1 * neg_log_post(params, intensities, time)

# Store covariance matrices
covariances = [0]*8

# Iterate through experiments and calculate hessians, covariances, and report
for row, exp_num in zip(df_params.iterrows(), range(0, 8)):
    # Extract parameters from row
    row = row[1].values
    
    # Establish arguments for each experiment
    args = (df["FRAP_exp_%i"%exp_num].values, df["time_(s)"].values)
    
    # Compute hessian
    hess = smnd.approx_hess(row, log_post, args=args)
    
    # Compute the covariance matrix
    cov = -np.linalg.inv(hess)
    covariances[exp_num] = cov
    
    # Report parameters within 95% confidence
    print("Experiment #%i" % exp_num)
    for name, param, i in zip(cols[1:], row, range(0, len(row))):
        print(name + ": %.4f ± %.4f"%(param, 2 * np.sqrt(cov[i,i])))
    print("")

  
  import sys


Experiment #0
I_0: 0.7747 ± 0.0046
f_b: 0.8521 ± 0.0253
f_f: 0.9402 ± 0.0068
k_off: 0.2380 ± 0.0411
D: 0.4642 ± 0.2052
sigma: 0.0103 ± 0.0012

Experiment #1
I_0: 0.8219 ± 0.0048
f_b: 0.8666 ± 0.0225
f_f: 0.9667 ± 0.0086
k_off: 0.1312 ± 0.0303
D: 0.8466 ± 0.1887
sigma: 0.0108 ± 0.0013

Experiment #2
I_0: 0.7911 ± 0.0046
f_b: 0.8621 ± 0.0229
f_f: 0.9680 ± 0.0074
k_off: 0.1745 ± 0.0324
D: 0.6324 ± 0.1857
sigma: 0.0103 ± 0.0012

Experiment #3
I_0: 0.6928 ± 0.0042
f_b: 0.7973 ± 0.0249
f_f: 0.9075 ± 0.0063
k_off: 0.3448 ± 0.0474
D: 0.3108 ± 0.1884
sigma: 0.0093 ± 0.0011

Experiment #4
I_0: 0.8689 ± 0.0042
f_b: 0.8347 ± 0.0193
f_f: 0.9065 ± 0.0053
k_off: 0.2575 ± 0.0244
D: 0.1852 ± 0.0899
sigma: 0.0095 ± 0.0011

Experiment #5
I_0: 0.8728 ± 0.0042
f_b: 0.8204 ± 0.0197
f_f: 0.9103 ± 0.0051
k_off: 0.2914 ± 0.0308
D: 0.2587 ± 0.1218
sigma: 0.0093 ± 0.0011

Experiment #6
I_0: 0.8188 ± 0.0046
f_b: 0.8229 ± 0.0223
f_f: 0.9553 ± 0.0065
k_off: 0.2513 ± 0.0346
D: 0.3130 ± 0.1495
sigma: 0.0102 ± 0.0012


In general, printed data is not as good as graphed data. I will report the credible regions for the diffusion coefficients and chemical rate constants. I will start this by making dataframes. 

In [10]:
int_D = [0]*len(covariances)
int_k_off = [0]*len(covariances)
for cov, i in zip(covariances, range(0, len(covariances))):
    k_off = df_params["k_off"].values[i]
    D = df_params["D"].values[i]
    int_D[i] = np.array([D - np.sqrt(cov[4,4]), D + np.sqrt(cov[4,4])])
    int_k_off[i] = np.array([k_off - np.sqrt(cov[3,3]), k_off + np.sqrt(cov[3,3])])
    
k = bokeh.plotting.Figure(height = 200,
                          width = 400,
                          x_axis_label = "k_off",
                          y_axis_label = "Experiment #",
                          title = "95% Confidence Intervals for k_off")
for i in range(0, 8):
    k.line(int_k_off[i], [i, i], line_width = 3)
bokeh.io.show(k)

In [11]:
D = bokeh.plotting.Figure(height = 200,
                          width = 400,
                          x_axis_label = "D",
                          y_axis_label = "Experiment #",
                          title = "95% Confidence Intervals for diffusion coefficient")
for i in range(0, 8):
    D.line(int_D[i], [i, i], line_width = 3)
bokeh.io.show(D)

It is especially striking that is so little overlap between diffusion coeficients and k_off.

To ensure the validity of these results, I want to make one last plot that compares the actual data with its respective MAP approximation. 

In [12]:
# Deals with coloring of different lines
colors = all_palettes['Viridis'][8]

first = True # used to only show the first plot. 

p = bokeh.plotting.Figure(width = 800, 
                          height = 500,
                          title = "Actual vs. Modeled Fluorescence Intensity",
                          x_axis_label = "Time (s)")


for row, exp_num in zip(df_params.iterrows(), range(0, 8)):
    # Extract parameters from row
    row = row[1].values
    name = "FRAP_exp_%i"%exp_num
    
    # Calculate values for theoretical model with MAP parameters
    model = np.zeros(len(df["time_(s)"].values))
    for t , i in zip(df["time_(s)"].values, range(0, len(model))):
        model[i] = I(row, t)
        
    p.line(df['time_(s)'].values,
           df[name].values, 
           color=colors[exp_num], 
           visible = first, 
           legend = "Experiment %i"%exp_num)
    
    p.line(df['time_(s)'].values,
           model, 
           color='black', 
           visible = first, 
           legend = "Experiment %i"%exp_num)
    
    p.legend.click_policy = 'hide'
    p.legend.location = "bottom_right"
    first = False # used to only show the first plot. 

bokeh.io.show(p)

  
  import sys


Looks good to me!