# Bayesian Inference: Theory

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** 

**Challenge:**

**Outline:**

## Why We Need Uncertainty

**The MLE is Over-Confident.** In safety-critical contexts, like those from the IHH, it's important that our ML models don't just fit the observed data well; they should also communicate with us the limits of their "knowledge." Let's illustrate what we mean. Consider the regression data below:

TODO: figure with in-between and OOD uncertainty. 

As you see in the figure above, we don't have data for some segments of the population of interest. 

TODO: explain why this is bad in the context of the task. 

TODO: give example with cats and COVID

Ideally, our learning algorithm would give us options; it would give us several models that all fit the data reasonably well, but behave differently away from the data. Given these options, perhaps we could devise some algorithm to select the one we would like to use on our downstream task, or find some way to *combine* them. Unfortunately, the learning algorithm we've use so far---the MLE---doesn't provide us with a way to do this. The MLE gives a *single* model. 

**Ensembling.** One way to solve this issue is by relying on the imperfections of the optimizer. Remember that, especially for more expressive models, optimization tends to get stuck in local optima. What if we were to collect an *ensemble* of models, all fit with the MLE to data, but each optimized from a different random initialization? Because each model would get stuck in a different local optima, each *might* behave differently than the others away from the data. What's nice about this approach is that it's easy to implement: we already have all the tools we need! Let's see what ensembling a neural network regression model looks like:

TODO figure of NN ensemble on above data

While effective in practice, ensembling also has one big problem when it comes to safety-critical contexts. It makes implicit assumptions that are difficult to understand. Specifically, we relied on the imperfections of our black-box optimizer to find us a diverse set of models. What kind of models will the optimizer give us, however? Do these models have an *inductive bias* that are appropriate for our task? 

The need for explicit assumptions motivates us to find an alternative way of fitting our models to day, leading us to the *Bayesian approach*. 

## Fitting Models via Bayes' Rule

**The Bayesian Paradigm.** So let's go back to the drawing board and rethink how we've been fitting models this whole time. So far, our approach has been finding the *single* model that maximizes the probability of our observed data: $\theta^\text{MLE} = \log p(\mathcal{D}; \theta)$. But isn't what we're actually interested is the *distribution* of models given the data, $p(\theta | \mathcal{D})$? In other words, conditioned on the data we've observed so far, we want to know which models (represented by their parameters, $\theta$) are likely to fit the data well. In this new paradigm, we hope that:
1. $p(\theta | \mathcal{D})$ will capture a diversity of models with different inductive biases.
2. We can make our assumptions clear, and we can specify what type of inductive biases are appropriate for our task.

**Bayes' Rule.** But what is $p(\theta | \mathcal{D})$, exactly? How can we possibly write down a distribution of models that fit the data well by hand? Isn't the whole point that the machine will do the learning for us? To get around this problem, we will use *Baye's rule* to write down $p(\theta | \mathcal{D})$ in terms of what we already know how to specify: $p(\mathcal{D} | \theta)$.

Recall from the chapter on joint probability that a joint distribution over two random variables, $A$ and $B$, can be factorized as follows:
\begin{align}
p_{A, B}(a, b) &= p_{B | A}(b | a) \cdot p_A(a) \quad (\text{Option 1}) \\
&= p_{A | B}(a | b) \cdot p_B(b) \quad (\text{Option 2})
\end{align}
This means we can also equate the two factorizations:
\begin{align}
p_{B | A}(b | a) \cdot p_A(a) &= p_{A | B}(a | b) \cdot p_B(b)
\end{align}
Diving both sides by $p_A(a)$, we get:
\begin{align}
p_{B | A}(b | a) &= \frac{p_{A | B}(a | b) \cdot p_B(b)}{p_A(a)} \quad \text{(Bayes' Rule)}
\end{align}
This is Bayes' rule. What's cool about it is that relates $p_{B | A}(b | a)$ to $p_{A | B}(a | b)$. 

**Bayesian Inference.** Using Bayes' rule in the context of our problem, let's treat $\mathcal{D}$ *and* $\theta$ as random variables. We can now relate $p(\theta | \mathcal{D})$, which we don't know how to specify, to $p(\mathcal{D} | \theta)$, which we do know how to specify:
\begin{align}
\underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} &= \frac{\overbrace{p(\mathcal{D} | \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{Z}_{\text{normalizing const.}}}
\end{align}
When used as a model-fitting paradigm, each term in Bayes' rule has a special name. We'll now define each:
* **Likelihood:** This is the data joint likelihood, which we've previously maximized as part of the MLE.
* **Prior:** This is the distribution of models we're willing to consider *before having observed any data*. As we will show visually in a bit, this distribution determines our *inductive bias*.
* **Posterior:** This is the distribution of interest. It's called a posterior because it determines the distribution of likely models, $\theta$, *after having observed data*.
* **Normalizing Constant**: This is a constant that turns the whole fraction into a valid probability density function (i.e. a function that integrates to 1). To compute $Z$, we integrate the numerator of Bayes' rule over the support of $\theta$:
    \begin{align}
    Z &= \int\limits \underbrace{p(\mathcal{D} | \theta) \cdot p(\theta)}_{\text{numerator of Bayes' rule}} d\theta
    \end{align}
    In this way, when we divide by it, the whole fraction integrates to $1$.

Now that we have a formula for the posterior, $p(\theta | \mathcal{D})$, what do we do with it? Since it's a distribution, we can *sample* from it. That is, we can draw $\theta \sim p(\theta | \mathcal{D})$ and visualize the models corresponding to each draw to plot our modeling uncertainty. The process of sampling from $p(\theta | \mathcal{D})$ is called *Bayesian inference.* In a bit, we'll also talk about how to quantitatively evaluate the fit of Bayesian models.

**Choice of Prior.** In comparison to our first model-fitting paradigm, the MLE, the Bayesian approach requires us to specify one more thing: the prior, $p(\theta)$. How do we pick a good prior? Generally, we have to think about the model's inductive bias. What types of models do we consider *realistic* for our setting? We will next illustrate the whole Bayesian paradigm visually to gain a better sense of what the prior means, and how it interacts with the likelihood to get the posterior. 

## Bayesian Modeling: Intuition

**Example: Bayesian Regression**. We'll now instantiate the Bayesian modeling paradigm for regression in 1-dimension to gain some intuition. 

TODO picture of prior

TODO picture of posterior after 1 data, after 2 data, etc.

**Relating Modeling Uncertainty to Everyday Life.**

## Deriving the Predictive Distribution

**Overview.** TODO
* Our goal isn't just to learn the parameters; our goal is to determine, given observed data points, what's the probability of new data? State goal mathematically.
* Bayesian inference (sampling from the posterior) will actually help us accomplish that.
* To derive the distribution of new data given old data, we'll have to introduce a few more facts about directed graphical models.
* We'll instantiate everything here for regression, but the general principles apply

**Representing Unobserved Variables in DGMs.**
* TODO show for regression, circle instead of dot

**Representing the Joint Distribution of Training and Test Data in a DGM.**

**Laws of Conditional Independence.**

**Derivation.**

````{admonition} Exercise: Deriving the Posterior Predictive Distribution
TODO
````