# Bayesian Inference: Prior and Posterior

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** If there's one thing we learned from the chapter on model selection and evaluation is that we should not blindly trust our models. Models are complicated and require a robust and diverse toolkit for responsible evaluation in their intended context. For safety-critical applications of ML, like the ones from the IHH, we must take additional precautions to ensure responsible use. We therefore adopt the following philosophy:
1. **Finite information $\rightarrow$ uncertainty.** We're often asked to make decisions without all the information necessary for certainty. We ask the same of our models: given a finite data set, we ask models to make predictions for patients they have never encountered. Therefore, for responsible use in safety-critical contexts, our models must have some way of quantifying the limits of their "knowledge."
2. **Not making choices $\rightarrow$ a choice will be made for you.** If we avoid making explicit choices in the design of our model, a choice will still be made for us---and it might not be the choice we want. For example, without explicitly choosing what's important to us, we might get a model with the highest accuracy for a task for which minimizing false negatives is most important. *It's therefore better to make your choices explicitly.*

**Challenge:** To satisfy our new modeling philosophy, we need (1) a way to quantify uncertainty, and (2) a way to understand how uncertainty depends on our modeling choices. How can we do that with the tools we have? As we show here, we can't. We will then introduce a new way of fitting ML models called Bayesian inference.

**Outline:** 
* Motivate the need for uncertainty
* Introduce a new modeling paradigm based on Bayes' rule
* Provide intuition for this modeling paradigm
* Implement this modeling paradigm in `NumPyro` 

## Why We Need Uncertainty

**The MLE is Over-Confident.** In safety-critical contexts, like those from the IHH, it's important that our ML models don't just fit the observed data well; they should also communicate with us the limits of their "knowledge." Let's illustrate what we mean. Consider the regression data below:

TODO: figure with in-between and OOD uncertainty. 

As you see in the figure above, we don't have data for some segments of the population of interest. 

TODO: explain why this is bad in the context of the task. 

TODO: give example with cats and COVID

Ideally, our learning algorithm would give us options; it would give us several models that all fit the data reasonably well, but behave differently away from the data. Given these options, perhaps we could devise some algorithm to select the one we would like to use on our downstream task, or find some way to *combine* them. Unfortunately, the learning algorithm we've use so far---the MLE---doesn't provide us with a way to do this. The MLE gives a *single* model. 

**Ensembling.** One way to solve this issue is by relying on the imperfections of the optimizer. Remember that, especially for more expressive models, optimization tends to get stuck in local optima. What if we were to collect an *ensemble* of models, all fit with the MLE to data, but each optimized from a different random initialization? Because each model would get stuck in a different local optima, each *might* behave differently than the others away from the data. What's nice about this approach is that it's easy to implement: we already have all the tools we need! Let's see what ensembling a neural network regression model looks like:

TODO figure of NN ensemble on above data

While effective in practice, ensembling also has one big problem when it comes to safety-critical contexts. It makes implicit assumptions that are difficult to understand. Specifically, we relied on the imperfections of our black-box optimizer to find us a diverse set of models. What kind of models will the optimizer give us, however? Do these models have an *inductive bias* that are appropriate for our task? 

The need for explicit assumptions motivates us to find an alternative way of fitting our models to day, leading us to the *Bayesian approach*. 

## Fitting Models via Bayes' Rule

**The Bayesian Paradigm.** So let's go back to the drawing board and rethink how we've been fitting models this whole time. So far, our approach has been finding the *single* model that maximizes the probability of our observed data: $\theta^\text{MLE} = \log p(\mathcal{D}; \theta)$. But isn't what we're actually interested is the *distribution* of models given the data, $p(\theta | \mathcal{D})$? In other words, conditioned on the data we've observed so far, we want to know which models (represented by their parameters, $\theta$) are likely to fit the data well. In this new paradigm, we hope that:
1. $p(\theta | \mathcal{D})$ will capture a diversity of models with different inductive biases.
2. We can make our assumptions clear, and we can specify what type of inductive biases are appropriate for our task.

**Bayes' Rule.** But what is $p(\theta | \mathcal{D})$, exactly? How can we possibly write down a distribution of models that fit the data well by hand? Isn't the whole point that the machine will do the learning for us? To get around this problem, we will use *Baye's rule* to write down $p(\theta | \mathcal{D})$ in terms of what we already know how to specify: the joint data likelihood, $p(\mathcal{D} | \theta)$. 

Let's derive Bayes' rule in general before applying it to our problem. Recall from the chapter on joint probability that a joint distribution over two random variables, $A$ and $B$, can be factorized as follows:
\begin{align}
p_{A, B}(a, b) &= p_{B | A}(b | a) \cdot p_A(a) \quad (\text{Option 1}) \\
&= p_{A | B}(a | b) \cdot p_B(b) \quad (\text{Option 2})
\end{align}
This means we can also equate the two factorizations:
\begin{align}
p_{B | A}(b | a) \cdot p_A(a) &= p_{A | B}(a | b) \cdot p_B(b)
\end{align}
Diving both sides by $p_A(a)$, we get:
\begin{align}
p_{B | A}(b | a) &= \frac{p_{A | B}(a | b) \cdot p_B(b)}{p_A(a)} \quad \text{(Bayes' Rule)}
\end{align}
This is Bayes' rule. What's cool about it is that relates $p_{B | A}(b | a)$ to $p_{A | B}(a | b)$. 

**Bayesian Inference.** Using Bayes' rule in the context of our problem, let's treat $\mathcal{D}$ *and* $\theta$ as random variables. We can now relate $p(\theta | \mathcal{D})$, which we don't know how to specify, to $p(\mathcal{D} | \theta)$, which we do know how to specify:
\begin{align}
\underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} &= \frac{\overbrace{p(\mathcal{D} | \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{normalizing const.}}}
\end{align}
When used as a model-fitting paradigm, each term in Bayes' rule has a special name. We'll now define each:
* **Likelihood:** This is the data joint likelihood, which we've previously maximized as part of the MLE.
    > For example, suppose we're fitting a linear regression model to predict an intergalactic being's glow given age. Our model is then:
    > \begin{align}
    p(\mathcal{D} | \theta) &= \prod\limits_{n=1}^N p(\mathcal{D}_n | \theta) \\
    &= \prod\limits_{n=1}^N p_{Y | X}(y_n | x_n, \theta) \\
    &= \prod\limits_{n=1}^N \mathcal{N}(y_n | \underbrace{\theta_0 + \theta_1 \cdot x_n}_{\mu(x_n; \theta)}, \sigma^2)
    \end{align}
    > where $\theta = \{ \theta_0, \theta_1 \}$ is the slope and intercept, and $\sigma$ is observation noise variance (which we fix as a constant for now). 
* **Prior:** This is the distribution of models we're willing to consider *before having observed any data*. As we will show visually in a bit, the prior allows us to specify our model's *inductive bias*.
    > Continuing with the above example, we know that in general, glow decreases with age. We can encode this belief into the inductive bias of the model by selecting an appropriate prior distribution---one for which the slope, $\theta_1$, is likely negative. As an example, we could select, $\mathcal{N}(-1, 0.1)$. In this way, $\theta_1$ is most likely to be near $-1$. We can similarly encode our belief into the intercept, $\theta_0$, saying we believe it should be positive: $\mathcal{N}(1, 0.1)$. 
    > Putting these together, we get the following prior distribution over our model parameters:
    > \begin{align}
    p_\theta(\cdot) = p_{\theta_1}(\cdot) \cdot p_{\theta_0}(\cdot) = \mathcal{N}(-1, 0.1) \cdot \mathcal{N}(1, 0.1)
    \end{align}
    As we will show later, in contrast to the ensembling approach, prior specification makes our assumptions about uncertainty explicit and easier to interrogate. 
* **Posterior:** This is the distribution of interest. It's called a posterior because it determines the distribution of likely models, $\theta$, *after having observed data*. As we will see in a bit, the posterior balances information from both the prior and the likelihood. 
* **Normalizing Constant:** This is a constant that turns the whole fraction into a valid probability density function (i.e. a function that integrates to 1). To compute $p(\mathcal{D})$, we integrate the numerator of Bayes' rule over the support of $\theta$:
    \begin{align}
    p(\mathcal{D}) &= \int\limits \underbrace{p(\mathcal{D} | \theta) \cdot p(\theta)}_{\text{numerator of Bayes' rule}} d\theta
    \end{align}
    In this way, when we divide by it, the whole fraction integrates to $1$. The formula for this normalizing constant is derived from the *[law of total probability](https://en.wikipedia.org/wiki/Law_of_total_probability)*---more on this at a later chapter. And for now, we won't worry about how to actually compute this integral.

Next, let's put everything together to get our new modeling paradigm.

## The Bayesian Modeling Paradigm

Thanks to Bayes' rule, we now have a way to formally answer the question, what's the probability of likely models given the observed data, $p(\theta | \mathcal{D})$. What's exciting about this is that our definition of $p(\theta | \mathcal{D})$ (the posterior) depends on the likelihood, which we have experience with from prior chapters. Let's put everything together to obtain our new modeling paradigm.
1. **Define a Bayesian Model.** First, we have to update our generative process. Whereas before, our generative process consisted of all the steps necessary to sample from $p(\mathcal{D} | \theta)$, our process will now sample from the joint distribution, $p(\mathcal{D}, \theta)$:
    \begin{align}
       p(\mathcal{D}, \theta) &= \underbrace{p(\mathcal{D} | \theta)}_{\text{like before}} \cdot \underbrace{p(\theta)}_{\text{new}}.
    \end{align}
    As you can see, our Bayesian generative process is the same as the one before, but it includes one additional step: sampling from the prior. 
    > Continuing with the Bayesian linear regression example, our generative process is:
    > \begin{align}
    \theta_0 &\sim p_{\theta_0}(\cdot) = \mathcal{N}(1, 0.1) \quad (\text{prior}) \\
    \theta_1 &\sim p_{\theta_1}(\cdot) = \mathcal{N}(-1, 0.1) \quad (\text{prior}) \\
    y_n | x_n, \theta_0, \theta_1 &\sim p_{Y | X, \theta_0, \theta_1}(\cdot | x_n, \theta_0, \theta_1) = \mathcal{N}(y_n | \underbrace{\theta_0 + \theta_1 \cdot x_n}_{\mu(x_n; \theta)}, \sigma^2) \quad (\text{likelihood})
    \end{align}
3. **Perform Bayesian Inference (i.e. model-fitting).** Having specified our Bayesian model, we can use Bayes' rule to obtain a formula for the posterior, $p(\theta | \mathcal{D})$---but what do we do with it? Since it's a distribution, we can *sample* from it. That is, we can draw $\theta \sim p(\theta | \mathcal{D})$ and visualize the models corresponding to each draw to plot our modeling uncertainty. The process of sampling from $p(\theta | \mathcal{D})$ is called *Bayesian inference.* In the next chapter, we will talk about how to quantitatively evaluate a Bayesian model's fit. 

**Illustration: Bayesian Regression.** Let's see what the Bayesian modeling paradigm looks like for regression, visually. Our generative process is:
\begin{align}
\theta &\sim p_\theta(\cdot) \quad (\text{prior}) \\
y_n | x_n, \theta &\sim p_{Y | X}(\cdot | x_n, \theta) = \mathcal{N}(\mu(x_n; \theta), \sigma^2) \quad (\text{likelihood}) \\
\end{align}
Here, $\sigma$ is a constant (so we can ignore it). We've picked an expressive function, $\mu(x_n; \theta)$, that will be fun to visualize---its details aren't important. 

Given our generative process, our goal is to sample the posterior,
\begin{align}
p(\theta | \mathcal{D}) &= p(\theta | x_1, \dots, x_N, y_1, \dots, y_n).
\end{align}
For intuition, we can visualize posterior samples $\theta \sim p(\theta | \mathcal{D})$ by plotting the *functions* they represent, $\mu(x_n; \theta)$. The plot below shows samples from the posterior as the number of points, $N$, increases. 

```{figure} _static/figs/example_online_bayesian_regression.png
---
width: 100%
name: bayesian-update-example
align: center
---

Samples from the posterior of a Bayesian regression model.
```

In the above plot, $N = 0$ represents our *prior*. The functions drawn from our prior illustrate our beliefs about which functions are appropriate for the data. In this specific case, our prior functions don't exhibit any strong trends; overall, the functions don't increase/decrease as age increases---they just wiggle about. However, the functions are incredibly smooth---another prior may have drawn more jagged functions. Whether this prior is appropriate for our task is up to you to decide. 

Next, we see what happens as we start observing data. As $N$ increases, you can see our prior distribution getting "filtered out" by the likelihood. By this, we mean that our posterior will sample functions that are both likely under the prior *and* likelihood. It therefore keeps samples from the prior that also go *through the data* to ensure the likelihood is high. As you can see, in regions of the input space near our observed data, the posterior is quite certain about the trend; it knows the function must pass close to the observed data. But as we move away from the observed data, the posterior maintains a diversity of possible functions. 

**Computational Efficiency.** Unfortunately for us, for most models, Bayesian inference is *intractable*, meaning there exists no efficient algorithm for posterior sampling. As a result, we will have to resort to approximations. This is the main drawback of Bayesian inference. Approximate Bayesian inference is fascinating, but unfortunately, we will not get to study it here. We will, however, learn how to use some approximate inference algorithms. 

## Bayesian Inference in `NumPyro`

**Convergence of Posterior as $N \rightarrow \infty$.**

**Dependence of Uncertainty on Prior Assumptions.**