## Notes on problem 9.1: hierarchical models and model selection.

**Hierarchical models** and **model selection** are useful for two reasons:
1. We use them to make more coherent arguments about what our data mean (relative to other analysis approaches).
2. They can be easily expressed with Bayes' Theorem.

Specifically, hierarchical models help us think about experimental repeats. We avoid the two extreme claims that experimental repeats are perfectly identical or that they are totally unrelated. Instead, we describe exactly how we expect different experiments to be related. However, there can be good reasons to go with either extreme claim.

Similarly, model selection quantifies the relative "goodness" of one model over another, considering all the relevant pieces of information. These include how well the model fits the data and the complexity of the model. If two math models result from derivations for different mechanisms, this is an exciting way to use quantitative measurements to infer which mechanism is more probable!

### (Nearly) all datasets are hierarchical.

I spend most of my time quantifying cell responses to different BMPs, an important developmental signaling protein. My experiments often look something like this:

![BMP2 data structure](9.1notes_fig1.png)

In Justin's post-doc, he and his labmates visited slaughterhouses to collect cow brains from which they could harvest tubulin. If they measured the kinetic binding properties of each tubulin sample, they would generate a dataset with the following structure.

![Tubulin data structure](9.1notes_fig2.png)

I imagine Frosty the Snowman considers the following data each year:

![Frosty data structure](9.1notes_fig3.png)

### But you need not use a hierarchical model (if you have a good reason not to).

In each of these cases, we're after some "true" parameter, such as the affinity of a cell receptor for BMP2, some kinetic property of tubulin, or when Frosty will fully disappear this spring. Great! We're excellent at parameter fitting. But which data do we fit? How do I expect to observe that parameter's true value: in the first cell I measure? In all the cells I measure? On average between cells I measured this week and last week? The answer isn't always straightforward.

Even though most datasets are hierarchical, we need not model them that way. In fact, we can choose one of three approaches: assume all repeats are identical, totally independent, or somewhere in between. Below, I consider each of these three approaches, why you might choose that approach, and how that affects the resulting model selection problem. (**Note that you can compare any two models: both hierarchical, both not, or one hierarchical and the other not!**) I like to think of this as a three step process: specify what the data and parameters are, write Bayes' theorem for the parameter estimation problem, and write Bayes' theorem for the model comparison problem.

For simplicity, the following is written as though we are just interested in one fit parameter, $\alpha$ for one model ($M_i$) or $\gamma$ for another model ($M_j$).

## Extreme case 1: Make one pooled dataset.

### Part 1: Specify the data and parameters. 
For this approach, we just have $D$, a list of every datapoint we've ever measured. **This assumes that all the observations of this experiment (perhaps done on different days, or with different aliquots of the same reagent) are generated from the same parameter value**, some $\alpha$ for $M_i$ or $\gamma$ for $M_j$. 

![Pooled dataset](9.1notes_pooled_data.png)

In the case of the BMP example, this would be a list of all concentrations of protein and the associated cell responses for each concentration. 

### Part 2: Estimate the true parameter.  
Having specified the data, we know how to write Bayes' Theorem. (We'll write explicitly that this assumes that one model, $M_i$, is true.)

\begin{align}
\\ P(\alpha \mid D, M_i, I) = \frac{P(D \mid \alpha, M_i, I) P( \alpha \mid M_i, I)}{P(D \mid M_i, I)}
\end{align}

We'll complete this analysis as before. We neglect the evidence ($P(D \mid M_i, I)$), which does not change which value $\alpha$ maximizes the posterior (i.e. is most probable). We specify our prior and likelihood, and then maximize the posterior.

This returns one most probable value of $\alpha$. We take this to be the "true value."

### Part 3: Compare $M_i$ to $M_j$.  
We start with the definition of the odds ratio.

\begin{align}
\\ O_{ij} &= \frac{P(M_i \mid D, I)}{P(M_j \mid D, I)}
\\ &= \frac{P( D \mid M_i, I) P(M_i \mid I)}{P( D \mid M_j, I) P(M_j \mid I)}
\end{align}

Note that we nearly always set $\frac{P(M_i \mid I)}{P(M_j \mid I)}$ equal to 1. This is usually because we have no good reason to prefer one model to another, but also because, unless we have very strong beliefs, it is hard to quantify the relative *a priori* probability of either model.

In the [model selection](http://bebi103.caltech.edu/2016/lecture_notes/l04_model_selection.pdf) and [PTMCMC](http://bebi103.caltech.edu/2016/lecture_notes/l05_ptmcmc.pdf) notes, Justin showed how we can find the missing term, $P( D \mid M_i, I)$. Note how it matches terms from the parameter estimation in Part 2.

\begin{align}
\\ P( D \mid M_i, I) &= \int d \alpha P( D \mid \alpha, M_i, I) P(\alpha \mid M_i, I)
\\ &= \mathrm{exp} \left( \int_0^1 d\beta \langle \mathrm{ln} P(D \mid \alpha, M_i, I) \rangle _{\beta} \right)
\\ &= Z_i(1)
\end{align}

These relationships are not immediately obvious and require careful derivation, but the key point is that PTMCMC returns the value $Z_i(1)$. We can use this to directly compute the odds ratio.

\begin{align}
\\ O_{ij} &= \frac{P( D \mid M_i, I) P(M_i \mid I)}{P( D \mid M_j, I) P(M_j \mid I)}
\\ &= \frac{Z_i(1) P(M_i \mid I)}{Z_j(1) P(M_j \mid I)}
\end{align}

#### Problems with assuming identical repeats:####  
If variability between different "versions" of this experiment are quite significant, those outliers may have disproportionate effects on your estimation of the "true" parameter. Moreover, you probably expect significant variation, particularly between experiments done on different days! The parameter $\alpha$ that generates the observed data _is_ likely different between days, but this approach does not capture that.

## Extreme case 2: Make _k_ separate datasets.

### Part 1: Specify the data and parameters. 
For this approach, we have _k_ datasets, each denoted $D_k$. **This assumes that all versions of this experiment (perhaps done on different days, or with different aliquots of the same reagent) are completely independent, and generated from completely independent values of $\alpha$ or $\gamma$.** 

![Independent datasets](9.1notes_ind_data.png)

In the case of the BMP example, this would be a separate list of protein concentrations and the associated cell responses for each concentration for each day. 

### Part 2: Estimate the true parameter.  
Having specified the data, we know how to write Bayes' Theorem. (Again, we'll write explicitly that this assumes that one model, $M_i$, is true.) This is the same procedure as for the pooled datasets, except that we repeat the analysis $k$ times, rather than doing it only once, as we have $k$ datasets, not just one.

\begin{align}
\\ P(\alpha_k \mid D_k, M_i, I) = \frac{P(D_k \mid \alpha_k, M_i, I) P( \alpha_k \mid M_i, I)}{P(D_k \mid M_i, I)}
\end{align}

We'll complete this analysis as before. We neglect the evidence ($P(D_k \mid M_i, I)$), which does not change which value $\alpha_k$ maximizes the posterior (i.e. is most probable). We specify our prior and likelihood, and then maximize the posterior.

The issue with reporting a *true* value of $\alpha$ is that we actually generate a set of values, $\alpha_1, \alpha_2, ..., \alpha_k$. Because we assumed the datasets are independent, we cannot say how these different parameters are related to each other. Therefore, it's unclear how to report the true value, though we might start by taking the median or the mean.

### Part 3: Compare $M_i$ to $M_j$.  

Here, independence of the datasets can simplify our model selection problem! We update our previous definition of the odds ratio by writing the data not as $D$ but the joint probability of $D_1, D_2, ..., D_k$.

\begin{align}
\\ O_{ij} &= \frac{P(M_i \mid D_1, \cdots, D_k, I)}{P(M_j \mid D_1, \cdots, D_k, I)}
\\ &= \frac{P( D_1, \cdots, D_k \mid M_i, I) P(M_i \mid I)}{P( D_1, \cdots, D_k \mid M_j, I) P(M_j \mid I)}
\end{align}

Now, we use a rule of probability. Stated with words, if $A$ and $B$ are independent events, their joint probability (the probability that both occur) is the product of the probability that either even occurs. Stated with math:

\begin{align}
\\ P(A, B \mid C) = P(A \mid C)P(B \mid C)
\end{align}

Our odds ratio simplifies to:

\begin{align}
\\ O_{ij} &= \frac{P(D_1 \mid M_i, I)}{P(D_1 \mid M_j, I)} \cdots \frac{P(D_k \mid M_i, I)}{P(D_k \mid M_j, I)}\frac{P(M_i \mid I)}{P(M_j \mid I)}
\\ &= \frac{P(M_i \mid I)}{P(M_j \mid I)} \prod_k \frac{P(D_k \mid M_i, I)}{P(D_k \mid M_j, I)}
\end{align}

Again, we can modify definitions we wrote in **Extreme case 1**, replacing $D$ with $D_k$, and $\alpha$ with $\alpha_k$. 

\begin{align}
\\ P( D_k \mid M_i, I) &= \int d \alpha_k P( D_k \mid \alpha_k, M_i, I) P(\alpha_k \mid M_i, I)
\\ &= \mathrm{exp} \left( \int_0^1 d\beta \langle \mathrm{ln} P(D_k \mid \alpha_k, M_i, I) \rangle _{\beta} \right)
\\ &= Z_i^k(1)
\end{align}

Therefore, computing the odds ratio for many independent datasets is just a small extension of computing the odds ratio for a single dataset. The full odds ratio ends up being the product of the odds ratios for each independent dataset.

\begin{align}
\\ O_{ij} &= \frac{P( D_k \mid M_i, I) P(M_i \mid I)}{P( D_k \mid M_j, I) P(M_j \mid I)}
\\ &= \frac{P(M_i \mid I)}{P(M_j \mid I)} \prod_k \frac{Z_i^k(1) }{Z_j^k(1)}
\end{align}

#### Problems with assuming independence:####  
Assuming no relationship between the datasets (i.e. total independence) makes it unclear how to relate the different parameter estimates, $\alpha_1, \alpha_2, ..., \alpha_k$, to the true value $\alpha$. We are clearly overlooking some true fact of nature, that the parameters generating these various versions of the experiment are in fact related. We can still easily solve the model selection problem though, which is useful.

## Intermediate case 3: Specify the relationships between the datasets

### Part 1: Specify the data and parameters. 
Here we take the most nuanced approach, where we explicitly state how the different experiments are related. Usually, we say that there is some hyperparameter, from which parameters are drawn, and those drawn parameters generate the observed data. **This assumes that all versions of this experiment (perhaps done on different days, or with different aliquots of the same reagent) are somewhat related, and generated from related, but not identical values of $\alpha$ or $\gamma$.** In this way, results from different "versions" of the experiment influence each other, in that their individual parameter estimates must make sense in light of some hyperparameter. But the model still explicitly states that we do not expect experiments to be identical.

![Hierarchical datasets](9.1notes_hier_data.png)

Note that the specific relationships can be specified in **many ways**, depending on how similar you deem different repeats to be. In the cartoon above, I show examples of 2 and 3 layer hierarchical models. **If you use a hierarchical model in your solution, state explicitly what relationships between the data you are capturing. Which points are independent, identical, or related (i.e. neither indepedent or identical)?**

For a concrete example, let's consider the cow brain examples. Maybe you don't  different preparations on different days to be that dissimilar. So, you would say that there is a *brain-specific* parameter, that depends on the "true" hyperparameter. Or maybe you have a different SURF student preparing the protein on different days of the week. Then you might consider the differences between days more significant. So you could have a *day-specific* parameter, which in turn depends on a *brain-specific* parameter, which depends on the "true" hyperparameter.

### Part 2: Estimate the true parameter.  
For a hierarchical model, Bayes' Theorem becomes significantly more complex. For one thing, we're no longer estimating a single parameter ($\alpha$) or a set of parameters in analogy to single parameters ($\alpha_k$). Instead, we're estimating a set of parameters (signified here as a vector, $\textbf{a}$) that depend on some hyperparameter (here $\alpha$). I'll also refer to the data as a vector now ($\textbf{D}$), as there are still subsets of the data (like $D_k$) that correspond to individual parameters (like element $a_k$ of vector $\textbf{a}$).

As Justin derived in [lecture](http://bebi103.caltech.edu/2016/lecture_notes/l08_hierarchical_models.pdf), we can simplify and rewrite the full posterior to get a version of Bayes' Theorem where we know how to mathematize each term.

\begin{align}
\\ P(\textbf{a}, \alpha \mid \textbf{D}, M_i, I) = \frac{P(\textbf{D} \mid \textbf{a}, M_i, I) P( \textbf{a} \mid \alpha, M_i, I) P(\alpha \mid M_i, I)}{P(\textbf{D} \mid M_i, I)}
\end{align}

$ P(\textbf{D} \mid \textbf{a}, M_i, I)$ is where we fit each "version" of the experiment independently.  
$ P( \textbf{a} \mid \alpha, M_i, I)$ is where we specify that each fit parameter depends on some hyperparameter.
$ P(\alpha \mid M_i, I)$ is where we specify our prior information about the parameter we wish to estimate.


We'll complete this analysis as before. We neglect the evidence ($P(\textbf{D} \mid M_i, I)$), which does not change which values $\textbf{a}$ and $\alpha$ maximize the posterior (i.e. are most probable). We specify all the terms, and then maximize the posterior.

We now have a true value $\alpha$ that we can report, without making problematic assumptions about different experiments being identical, but without completely separating the datasets from each other.

### Part 3: Compare $M_i$ to $M_j$.  

Here, our model selection problem does not simplify as easily. Without indepedence of the dataset, our definition of the odds ratio is unchanged.

\begin{align}
\\ O_{ij} &= \frac{P(M_i \mid \textbf{D}, I)}{P(M_j \mid \textbf{D}, I)}
\\ &= \frac{P( \textbf{D} \mid M_i, I) P(M_i \mid I)}{P( \textbf{D} \mid M_j, I) P(M_j \mid I)}
\end{align}

As we have done before, we see how $P( \textbf{D} \mid M_i, I)$ is the fully marginalized numerator of the parameter estimation problem. We specified this above.

\begin{align}
\\ P( \textbf{D} \mid M_i, I) &= \iint\limits_{k+1} d\textbf{a} \ d\alpha \ P(\textbf{D} \mid \textbf{a}, M_i, I) P( \textbf{a} \mid \alpha, M_i, I) P(\alpha \mid M_i, I)
\\ &= Z_i(1)
\end{align}

Though more mathematically complicated, we can still get the desired quantity ($ P( \textbf{D} \mid M_i, I)$) from PTMCMC (as $Z_i(1)$). We can still compute the odds ratio exactly as:

\begin{align}
\\ O_{ij} &= \frac{P( \textbf{D} \mid M_i, I) P(M_i \mid I)}{P( \textbf{D} \mid M_j, I) P(M_j \mid I)}
\\ &= \frac{Z_i(1) }{Z_j(1)}\frac{P(M_i \mid I)}{P(M_j \mid I)}
\end{align}

#### Problems with the full treatment:####  
While this approach doesn't have the logical inconsistencies of calling related experiments either identical or independent, it can be computationalyl expensive. Specifying a hierarchical model increases the number of parameters, and is difficult to code up. So be thoughtful when considering how much RAM and time it will take to solve problems this way.

## Part 4: Remember there are two ways (one precise, one approximate) to compute the odds ratio.

**Precise odds ratio:**
\begin{align}
\\ O_{ij} &= \frac{P(M_i \mid I)}{P(M_j \mid I)}\frac{\int d \alpha \ P( D \mid \alpha, M_i, I) P(\alpha \mid M_i, I)}{\int d \gamma \ P( D \mid \gamma, M_j, I) P(\gamma \mid M_j, I)}
\\ &= \frac{P(M_i \mid I)}{P(M_j \mid I)} \frac{Z_i(1) }{Z_j(1)}
\end{align}

To compute $Z_i(1)$ and $Z_j(1)$, we must use PTMCMC.

**Approximate odds ratio:**  
For sharply peaked (for example, Gaussian) posteriors, we can approximate the integral with the value of the integrand at the MAP (here, $\alpha^*$) multiplied by its width.

\begin{align}
\\ O_{ij} &= \frac{P(M_i \mid I)}{P(M_j \mid I)}\frac{\int d \alpha \ P( D \mid \alpha, M_i, I) P(\alpha \mid M_i, I)}{\int d \gamma \ P( D \mid \gamma, M_j, I) P(\gamma \mid M_j, I)}
\\ &\propto \frac{P(M_i \mid I)}{P(M_j \mid I)} \frac{P(D \mid \alpha^*, M_i, I) P(\alpha^* \mid M_i, I) \sqrt{2 \pi \sigma_{\alpha}^2} }{P(D \mid \gamma^*, M_j, I) P(\gamma^* \mid M_j, I) \sqrt{2 \pi \sigma_{\gamma}^2}}
\end{align}

(See [Justin's lecture notes](http://bebi103.caltech.edu/2016/lecture_notes/l04_model_selection.pdf), equation 4.10, for the multivariate form of this approximate integral.)

To compute $O_{ij}$, we just need the MAP and the covariance matrix for the various parameters. These can be trivial to find, either with optimization or MCMC.

**Combination approach:** Also note that you can improve the performance of your PTMCMC by generating better initial guesses for walker positions. You can do this by first solving for all the parameters (whether your model is hierarchical or not) with optimization, starting MCMC at the optimization solution, and then starting the PTMCMC at the MCMC solution.