# Bayesian Inference: Posterior Predictive

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** For safety-critical applications of ML, it's important that our model captures two notions of uncertainty. Aleatoric uncertainty captures inherent stochasticity in the system. In contrast, epistemic uncertainty is uncertainty over possible *models* that could have fit the data. Multiple models can fit the data when we have a lack of data and/or a lack of mechanistic understanding of the system. We realized that fitting models using the MLE only captured aleatoric uncertainty. To additionally capture epistemic, we therefore had to rethink our modeling paradigm. Using Bayes' rule, we were able to obtain a *distribution* over model parameters given the data, $p(\theta | \mathcal{D})$ (the posterior). Using `NumPyro`, we sampled from the posterior to obtain a diversity of models that fit the data. We interpreted a greater diversity of models indicated higher epistemic ucnertainty. 

**Challenge:** Now that we have a posterior over model parameters, we can capture *epistemic* uncertainty. But how do we use this diverse set of models to (1) make predictions, and (2) compute the log-likelihood (for evaluation)? To do this, we will derive the *posterior predictive*, a distribution that translates a distribution over parameters to a distribution over data. This distributions can then be used to make predictions and evaluate the log-likelihood.

**Outline:** 
* Provide intuition for the posterior predictive
* Derive the posterior predictive
* Introduce laws of conditional independence
* Evaluate the posterior predictive

## Intuition: Model Averaging

**Bayesian Modeling as Ensembling.** Recall in the previous chapter, we initially introduced *ensembling* as a way to capture epistemic uncertainty. In ensembling, we train a collection of models independently and hope that, due to quirks in optimization, we end up with a diverse collection of models. In a sense, doesn't our Bayesian approach provide us with an ensemble as well? After all, each set of parameters $\theta$ from the posterior $p(\theta | \mathcal{D})$ represents a different model. Based on this analogy, we can create a "Bayesian" ensemble as follows:
1. We draw $S$ samples from the posterior. Each sample $\theta_s$ now represents a member of our ensemble.
2. Each posterior sample represents a different model, $p(\mathcal{D} | \theta_s)$.
    > For regression, we have $p_{Y | X}(y | x, \theta_s)$.

**Predicting.** Using this ensemble, we can predict by *averaging* the predictions of the ensemble members:
1. We draw $\mathcal{D}_s \sim p(\cdot | \theta_s)$ for each $\theta_s$.
    > For regression, we draw $y_s \sim p_{Y | X}(\cdot | x, \theta_s)$.
2. We average: $\frac{1}{S} \sum\limits_{s=1}^S \mathcal{D}_s$.
    > For regression, we average $\frac{1}{S} \sum\limits_{s=1}^S y_s$.

**Evaluating Log-Likelihood.** Given test data, $\mathcal{D}^*$, we can use the ensemble to evaluate the model's log-likelihood:
1. We evaluate $p(\mathcal{D}^* | \theta_s)$ for each $\theta_s$.
    > For regression, we evaluate $p_{Y | X}(y | x, \theta_s)$ for each $\theta_s$.
2. We average and take the log: $\log \frac{1}{S} \sum\limits_{s=1}^S p(\mathcal{D}^* | \theta_s)$.
    > For regression, we average and take the log: $\log \frac{1}{S} \sum\limits_{s=1}^S p_{Y | X}(y^* | x^*, \theta_s)$.

**Formalizing Intuition.** As we will show next, this intuition is actually correct. 

## Derivation of the Posterior Predictive

**Goal.** We want to derive a formula for $p(\mathcal{D}^* | \mathcal{D})$, which represents the distribution of new data $\mathcal{D}^*$ given the observed data, $\mathcal{D}$. 
> For a regression model, this distribution is:
> \begin{align}
    p_{Y^* | X^*, \mathcal{D}}(y^* | x^*, \mathcal{D}) &= p_{Y^* | X^*, \mathcal{D}}(y^* | x^*, x_1, \dots, x_N, y_1, \dots, y_N),
\end{align}
> where $x^*$ is a *new* input for which we'd like to make a prediction, $y^*$.

**A Graphical Model for the Training *and* Test Data.** Notice that our posterior predictive includes a new random variable, $\mathcal{D}*$. Let's incorporate it into our graphical model so we can better reason about it. 

TODO depict DGM for abstract model and regression

**Derivation.** Now we have all we need in order to derive a formula for $p(\mathcal{D}^* | \mathcal{D})$. Our first step is to multiply and divide $p(\mathcal{D}^* | \mathcal{D})$ by $p(\mathcal{D})$:
\begin{align}
p(\mathcal{D}^* | \mathcal{D}) &= \frac{p(\mathcal{D}^* | \mathcal{D}) \cdot p(\mathcal{D})}{p(\mathcal{D})} 
\end{align}
We do this so that we can write the numerator as the *joint* distribution of $\mathcal{D}^*$ and $\mathcal{D}$:
\begin{align}
&= \frac{p(\mathcal{D}^*, \mathcal{D})}{p(\mathcal{D})} 
\end{align}
Next, we use the law of total probability to re-write the above as a joint distribution over $\mathcal{D}^*$, $\mathcal{D}$, and $\theta$. We do this to introduce $\theta$ into the equation---since our model's prior, likelihood, and posterior all depend on $\theta$, it would be weird if the formula for $p(\mathcal{D}^* | \mathcal{D})$ didn't depend on it. This gives us:
\begin{align}
&= \frac{\int p(\mathcal{D}^*, \mathcal{D}, \theta) \cdot d\theta}{p(\mathcal{D})} 
\end{align}
Now, we can factorize this joint distribution to get one term that's the posterior, $p(\theta | \mathcal{D})$, and one term that's the marginal, $p(\mathcal{D})$:
\begin{align}
&= \frac{\int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot p(\mathcal{D}) \cdot d\theta}{p(\mathcal{D})} 
\end{align}
Since $p(\mathcal{D})$ doesn't depend on $\theta$, we can take it out of the integral, thereby canceling it with the $p(\mathcal{D})$ in the denominator:
\begin{align}
&= \int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot d\theta
\end{align}
Finally, using the laws of conditional independence, we know that $p(\mathcal{D}^* | \theta, \mathcal{D}) = p(\mathcal{D}^* | \theta)$. This is because, by conditioning on $\theta$, we remove all paths connecting $\mathcal{D}$ to $\mathcal{D}^*$. In other words, $\theta$ summarizes all information from $\mathcal{D}$ needed to predict $\mathcal{D}^*$. This gives us the following equation:
\begin{align}
&= \int \underbrace{p(\mathcal{D}^* | \theta)}_{\text{likelihood of new data}} \cdot \underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} \cdot d\theta
\end{align}
As you can see, $p(\mathcal{D}^* | \mathcal{D})$ is a function of the posterior and the joint data likelihood of the new data. Adding some syntactic sugar, we can write the above equation as:
\begin{align}
&= \mathbb{E}_{p(\theta | \mathcal{D})} \left[ p(\mathcal{D}^* | \theta) \right]
\end{align}
This shows that to evaluate $p(\mathcal{D}^* | \mathcal{D})$, we need to:
1. Draw posterior samples $\theta \sim p(\theta | \mathcal{D})$.
2. Average the likelihood $p(\mathcal{D}^* | \theta)$ across these samples.

As you can see this matches our intuition exactly! 

## Laws of Conditional Independence

Recall that the slope and intercept of a model may have been sampled independently under the prior, but as soon as we condition on data, they are no longer independent. 

## Posterior Predictive of Different Models

````{admonition} Exercise: Derive the Posterior Predictive Distribution

For each of the models below, draw the directed graphical model, and then derive the posterior predictive formula.

**Part 1:** Bayesian predictive model. 
\begin{align}
\theta &\sim p_\theta(\cdot) \\
y_n | x_n, \theta &\sim p_{Y | X}(\cdot | x_n, \theta)
\end{align}

**Part 2:** Bayesian Concept-Bottlebeck model (CBM). CBMs aim to make it easier to interpret model predictions. They do this by combining two models:
* CMBs first learning to predict "concepts" $c_n$ associated, associated with input $x_n$. In a CBM, a concept is just a discrete attribute associated with the input; for example, if $x_n$ is an image of wildlife, a concept could be rain, grass, dog, etc. You can think of $p_{C | X}$ as a classifier. 
* After having predicted the concept $c_n$ from the input $x_n$, CBMs attempt to predict the final output $y_n$ from the concept only. In this way, predictions of $y_n$ can be analyzed in terms of the concepts, which as easier to understand, instead of with respect to the inputs, which could be high dimensional and difficult to reason about.

A Bayesian CBM has the following generative process:
\begin{align}
\theta &\sim p_\theta(\cdot) \\
\phi &\sim p_\phi(\cdot) \\
c_n | x_n, \theta &\sim p_{C | X}(\cdot | x_n, \theta) = \mathrm{Cat}(\pi(x_n; \theta)) \\
y_n | c_n, \phi &\sim p_{Y | C}(\cdot | c_n, \phi),
\end{align}
where $\pi(\cdot; \theta)$ is a function that maps $x_n$ to the parameters of a categorical distribution. 

**Part 3:** Bayesian Factor Analysis.
\begin{align}
\theta &\sim p_\theta(\cdot) \\
z_n &\sim p_Z(\cdot) \\
x_n | z_n, \theta &\sim p_{X | Z, \theta}(\cdot | z_n, \theta) 
\end{align}

**Part 4:** Bayesian predictive model with input noise.
\begin{align}
\theta &\sim p_\theta(\cdot) \\
z_n &\sim p_Z(\cdot) \\
y_n | x_n, z_n, \theta &\sim p_{Y | X, Z, \theta}(\cdot | x_n, z_n, \theta) 
\end{align}
````

## Model Evaluation using the Posterior Predictive