# Bayesian Inference: Posterior Predictive

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** For safety-critical applications of ML, it's important that our model captures two notions of uncertainty. Aleatoric uncertainty captures inherent stochasticity in the system. In contrast, epistemic uncertainty is uncertainty over possible *models* that could have fit the data. Multiple models can fit the data when we have a lack of data and/or a lack of mechanistic understanding of the system. We realized that fitting models using the MLE only captured aleatoric uncertainty. To additionally capture epistemic, we therefore had to rethink our modeling paradigm. Using Bayes' rule, we were able to obtain a *distribution* over model parameters given the data, $p(\theta \mathcal{D})$ (the posterior). Using `NumPyro`, we sampled from the posterior to obtain a diversity of models that fit the data. We interpreted a greater diversity of models indicated higher epistemic ucnertainty. 

**Challenge:** Now that we have a posterior over model parameters, we can capture *epistemic* uncertainty. But how do we use this diverse set of models to (1) make predictions, and (2) compute the log-likelihood (for evaluation)? To do this, we will derive the *posterior predictive*, a distribution that translates a distribution over parameters to a distribution over data. This distributions can then be used to make predictions and evaluate the log-likelihood.

**Outline:** 
* Provide intuition for the posterior predictive
* Derive the posterior predictive
* Introduce two probability laws used in the derivation
* Evaluate the posterior predictive

## Intuition: Model Averaging



## Derivation of the Posterior Predictive

**A Graphical Model for the Training *and* Test Data.** TODO

**General Derivation.** Now we have all we need in order to derive a formula for $p(\mathcal{D}^* | \mathcal{D})$. Our first step is to multiply and divide $p(\mathcal{D}^* | \mathcal{D})$ by $p(\mathcal{D})$. 
\begin{align}
p(\mathcal{D}^* | \mathcal{D}) &= \frac{p(\mathcal{D}^* | \mathcal{D}) \cdot p(\mathcal{D})}{p(\mathcal{D})} 
\end{align}
We do this so that we can write the numerator as the *joint* distribution of $\mathcal{D}^*$ and $\mathcal{D}$:
\begin{align}
&= \frac{p(\mathcal{D}^*, \mathcal{D})}{p(\mathcal{D})} 
\end{align}
Next, we use the law of total probability to re-write the above as a joint distribution over $\mathcal{D}^*$, $\mathcal{D}$, and $\theta$. We do this to introduce $\theta$ into the equation---since our model's prior, likelihood, and posterior all depend on $\theta$, it would be weird if the formula for $p(\mathcal{D}^* | \mathcal{D})$ depend on it. This gives us:
\begin{align}
&= \frac{\int p(\mathcal{D}^*, \mathcal{D}, \theta) \cdot d\theta}{p(\mathcal{D})} 
\end{align}
Now, we can factorize this joint distribution to get one term that's the posterior, $p(\theta | \mathcal{D})$, and one term that's the marginal, $p(\mathcal{D})$:
\begin{align}
&= \frac{\int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot p(\mathcal{D}) \cdot d\theta}{p(\mathcal{D})} 
\end{align}
Since $p(\mathcal{D})$ doesn't depend on $\theta$, we can take it out of the integral, thereby canceling it with the $p(\mathcal{D})$ in the denominator:
\begin{align}
&= \int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot d\theta
\end{align}
Finally, using the laws of conditional independence, we know that $p(\mathcal{D}^* | \theta, \mathcal{D}) = p(\mathcal{D}^* | \theta)$. This is because, by conditioning on $\theta$, we remove all paths connecting $\mathcal{D}$ to $\mathcal{D}^*$. In other words, $\theta$ summarizes all information from $\mathcal{D}$ needed to predict $\mathcal{D}^*$. This gives us the following equation:
\begin{align}
&= \int \underbrace{p(\mathcal{D}^* | \theta)}_{\text{likelihood of new data}} \cdot \underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} \cdot d\theta
\end{align}
As you can see, $p(\mathcal{D}^* | \mathcal{D})$ is a function of the posterior and the joint data likelihood of the new data. Adding some syntactic sugar, we can write the above equation as:
\begin{align}
&= \mathbb{E}_{p(\theta | \mathcal{D})} \left[ p(\mathcal{D}^* | \theta) \right]
\end{align}
This shows that to evaluate $p(\mathcal{D}^* | \mathcal{D})$, we need to:
1. Draw posterior samples $\theta \sim p(\theta | \mathcal{D})$.
2. Average the likelihood $p(\mathcal{D}^* | \theta)$ across these samples.

Interpretation: the posterior predictive *averages* the diversity of models given to us by the posterior.

## The Geometry of Joint Distributions

**Bayes' Rule $\rightarrow$ Cross-Sections.** TODO

**Law of Total Probability $\rightarrow$ Sum of Cross-Sections.** TODO

## Laws of Conditional Independence

Recall that the slope and intercept of a model may have been sampled independently under the prior, but as soon as we condition on data, they are no longer independent. 

## Deriving the Posterior Predictive

````{admonition} Exercise: Derive the Posterior Predictive Distribution

For each of the models below, draw the directed graphical model, and then derive the posterior predictive formula.

**Part 1:** Bayesian predictive model. 
\begin{align}
\theta &\sim p_\theta(\cdot) \\
y_n | x_n, \theta &\sim p_{Y | X}(\cdot | x_n, \theta)
\end{align}

**Part 2:** Bayesian Concept-Bottlebeck model (CBM).
\begin{align}
\theta &\sim p_\theta(\cdot) \\
\phi &\sim p_\phi(\cdot) \\
c_n | x_n, \theta &\sim p_{C | X}(\cdot | x_n, \theta) = \mathrm{Cat}(\pi(x_n; \theta)) \\
y_n | c_n, \phi &\sim p_{Y | C}(\cdot | c_n, \phi),
\end{align}
where $\pi(\cdot; \theta)$ is a function that maps $x_n$ to the parameters of a categorical distribution. CBMs aim to make it easier to interpret model predictions. They do this by combining two models:
1. CMBs first learning to predict "concepts" $c_n$ associated, associated with input $x_n$. In a CBM, a concept is just a discrete attribute associated with the input; for example, if $x_n$ is an image of wildlife, a concept could be rain, grass, dog, etc. You can think of $p_{C | X}$ as a classifier. 
2. After having predicted the concept $c_n$ from the input $x_n$, CBMs attempt to predict the final output $y_n$ from the concept only. In this way, predictions of $y_n$ can be analyzed in terms of the concepts, which as easier to understand, instead of with respect to the inputs, which could be high dimensional and difficult to reason about.
````

## Model Evaluation using the Posterior Predictive