# Probabilistic Machine Learning

By: Chengyi (Jeff) Chen

In [3]:
%load_ext autotime
%load_ext nb_black

import torch
import pyro
import pyro.distributions as dist

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
time: 897 µs


<IPython.core.display.Javascript object>

---
## Introduction

The purpose of these sets of notes is to connect ideas crossing the realms of frequentist, bayesian, probabilistic machine learning vernacular, e.g. how 1. frequentist maximum likelihood estimation is related to 2. partial bayesian maximum a posteriori and 3. full bayesian inference. I'm in no way an expert of the philosophical and practical differences between the the frequentist vs. bayesian perspective nor am I close to being good at mathematics -- here's just what I've gathered from my readings, subject to my own interpretation. Throughout, I'll be drawing ideas from computer programming as well, specifically notes on [Uber's Pyro PPL](http://pyro.ai/examples/intro_long.html). Starting from first principles, we ask: **"What are we even trying to do in machine learning?"**

### Setup

Before we distinguish between supervised, unsupervised, semi-supervised learning, here's the general probabilistic machine learning setting:

We are given a matrix of observed training data $\mathbf{X}_{\text{train}} = \{ \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \ldots \mathbf{x}_N \}$ as independent samples generated from a true data distribution $f(\mathcal{X})$, where $\mathbf{x} \in \mathcal{X}$ (the set of observed data values).

### Objective

We specify a probabilistic model of the form / factorization structure

\begin{align}
    p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta) &= p(\mathcal{X} \vert \mathcal{Z} ; \Theta = \theta) p(\mathcal{Z} ; \Theta = \theta) \\
\end{align}

to learn $\mathbf{X}_{\text{train}}$ to approximate $f(\mathcal{X})$, where $\mathbf{z} \in \mathcal{Z}$ are a set of latent / unobserved random variables, as we make no assumptions on whether the observable dataset $\mathbf{X}_{\text{train}}$ contains all information about the system. This probabilistic model is often called the **complete data likelihood**.

\begin{align}
    p(\mathcal{X} ; \Theta = \theta) \\
\end{align}

is then called the **incomplete data likelihood** / **evidence** / **marginal likelihood** (because we marginalized out $\mathcal{Z}$ to keep only $\mathcal{X}$. $\Theta = \theta$ are fixed parameters ("$;$" is used instead of "$\vert$" in the conditioning of $\theta$ to indicate that it is a "frequentist" fixed parameter and not a "bayesian" random variable). Furthermore, it's called a likelihood function because it is a function over the $\theta = \Theta$, the thing we're conditioning on, instead of $\mathcal{X}$ (fixed because its the data provided) and $\mathcal{Z}$ (unobserved). 

Learning such a probabilistic model has [2 primary objectives](https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation):

#### Objective 1

Draw conclusions about the posterior distribution of our latent variables $\mathcal{Z}$:

\begin{align}
    p(\mathcal{Z} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta) 
    &= \frac{p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z}; \Theta = \theta)}{\int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathbf{z}} \\
\end{align}

#### Objective 2

Make predictions for new data, which we can do with the posterior predictive distribution:

\begin{align}
    p(\mathcal{X} = \mathbf{X}_{\text{test}} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta) &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{test}} \vert \mathcal{Z} = \mathbf{z}; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} \vert \mathcal{X} = \mathbf{X}_{\text{train}}; \Theta = \theta)  d\mathbf{z} \\
\end{align}

---
## Maximum Likelihood Estimation

### How to find the best $p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)$?

To learn the $p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta)$, we need to first design a **measure of success** -- how useful our model is / how accurate are we modelling the real life true data distribution. Because we can only observe $\mathcal{X}$, let's define a "distance" measure between our incomplete data likelihood $p(\mathcal{X} ; \Theta = \theta)$ (instead of complete data likelihood because we can't observe it) and the true data distribution $f(\mathcal{X})$. The smaller the "distance" between our 2 distributions the better our model approximates the true data generating process. A common "distance" measure between probability distributions is the [KL Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) ("distance" because KL Divergence is asymmetric, does not satisfy triangle inequality, $D_{KL}(P \vert\vert Q) \not= D_{KL}(Q \vert\vert P)$). $D_{KL}(f(\mathcal{X}) \vert \vert p(\mathcal{X};\Theta=\theta))$ measures how well [$p$ approximates $f$](https://stats.stackexchange.com/questions/111445/analysis-of-kullback-leibler-divergence):

\begin{align}
    \theta^* 
    &= \arg\underset{\theta \in \Theta}{\min} D_{KL}(f \vert \vert p) \\
    &= \arg\underset{\theta \in \Theta}{\min}\int_{\mathbf{x} \in \mathcal{X}} f(\mathcal{X}=\mathbf{x}) \log \frac{f(\mathcal{X}=\mathbf{x})}{p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)} d\mathbf{x} \\
    &= \arg\underset{\theta \in \Theta}{\min}\mathbb{E}_{\mathbf{x} \sim f} [\log f(\mathcal{X}=\mathbf{x})] - \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\
    &= \arg\underset{\theta \in \Theta}{\min}-\mathbb{H}[f(\mathcal{X})] - \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\
    &= \arg\underset{\theta \in \Theta}{\max} \mathbb{E}_{\mathbf{x} \sim f} [\log p(\mathcal{X}=\mathbf{x} ; \Theta = \theta)] \\
    &\approx \arg\underset{\theta \in \Theta}{\max} \lim_{N \rightarrow \infty} \frac{1}{N}\sum_{\mathbf{x}_i \in \mathbf{X}_{\text{train}}} \log p(\mathcal{X}=\mathbf{x}_i ; \Theta = \theta) \because \text{law of large numbers} \\
    &= \arg\underset{\theta \in \Theta}{\max} \prod_{\mathbf{x}_i \in \mathbf{X}_{\text{train}}} p(\mathcal{X}=\mathbf{x}_i ; \Theta = \theta) \because \log\text{ is a monotonic increasing function} \\
    &= \arg\underset{\theta \in \Theta}{\max} p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta) \because \text{i.i.d. data assumption} \\
    &= \theta_{\text{MLE}}
\end{align}

We have thus arrived at [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) of parameters (you can read more about this derivation method [here](https://slideplayer.com/slide/9502040/) and [here](https://jaketae.github.io/study/kl-mle/)), a pointwise estimate of the parameters that maximizes the incomplete data likelihood (or complete data likelihood when we have no latent variables in the model).

### Why is MLE a "frequentist" inference technique?

The primary reason for why this technique is coined a "frequentist" method is because of the assumption that $\Theta = \theta$ is a fixed parameter that needs to be estimated, while bayesians believe that $\Theta = \theta$ should be a random variable, and hence, have a probability distribution that describes its behavior $p(\Theta)$, calling it our **prior**. In probabilistic programming / machine learning however, we don't have to worry about these conflicting paradigms. To "convert" $\Theta$ into a random variable instead, we just need to move $\Theta$ into $\mathcal{Z}$ and as long as we have a way to model $\mathcal{Z}$, more specifically $p(\mathcal{Z} \vert \mathcal{X} ; \Theta = \theta)$, the **posterior** distribution of our latent variables, we are good.

### Can we simply find the $\theta$ that maximizes $p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)$?

Unfortunately, because our model is specified with the latent variables $\mathcal{Z}$, we can't directly maximize $p(\mathcal{X}=\mathbf{X}_{\text{train}} ; \Theta = \theta)$. We'll have to marginalize out the latent variables first as follows:

\begin{align}
    p(\mathcal{X} = \mathbf{X}_{\text{train}} ; \Theta = \theta) 
    &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}}, \mathcal{Z} = \mathbf{z}; \Theta = \theta) d\mathbf{z} \\
    &= \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}} \vert \mathcal{Z} = \mathbf{z} ; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} ; \Theta = \theta) d\mathbf{z} \\
\end{align}

and hence, Maximum Likelihood Estimation becomes:

\begin{align}
    \theta^* 
    &= \arg\underset{\theta \in \Theta}{\max} \int_{\mathbf{z} \in \mathcal{Z}} p(\mathcal{X} = \mathbf{X}_{\text{train}} \vert \mathcal{Z} = \mathbf{z} ; \Theta = \theta) p(\mathcal{Z} = \mathbf{z} ; \Theta = \theta) d\mathbf{z} \\
\end{align}

However, this marginalization is often intractable (e.g. if $\mathcal{Z}$ is a sequence of events, so that the number of values grows exponentially with the sequence length, the exact calculation of the integral will be extremely difficult). Let's instead try to find a lower bound for it by expanding it.

---
## Full Bayesian Inference

### Searching for the ELBO

Using ideas from importance sampling, assume we have another variational distribution [approximate posterior distribution to $p({\mathcal{Z}} \mid {\mathcal{X}} ; \Theta = \theta)$], $q(\mathcal{Z} ; \Phi = \phi)$, where $q(\mathcal{Z} ; \Phi = \phi) > 0$ whenever $p({\mathcal{Z}}) = \int_{\mathbf{x} \in \mathcal{X}} p({\mathcal{X}} = x, {\mathcal{Z}} \mid {\bf \theta}) > 0$, and we rewite:

\begin{align}
    \log p(\mathcal{X} \mid \boldsymbol{\theta }) 
    &= \log \sum_{z \in \mathcal{Z}} p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \frac{q({\mathcal{Z} = z} \mid {\bf \phi})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \\
    &= \log \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \right] \\
\end{align}

By Jensen's Inequality, given concave function $f(X)$ (e.g. $\log$), $f\operatorname {E}\left[X\right] \geq \operatorname {E}\left[f(X)\right]$ {cite}`Variatio28:online`:

\begin{align}
    \log p(\mathcal{X} \mid \boldsymbol{\theta }) 
    &= \log \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})} \right] \\
    &\geq \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log\left(\frac{p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})}{q({\mathcal{Z} = z} \mid {\bf \phi})}\right)\right] \\
    &= \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) - \log q({\mathcal{Z} = z} \mid {\bf \phi})\right] \\
    &= \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] - \operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] \\
    &= \underbrace{\underbrace{\operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right]}_{\text{Expected Complete-data Log Likelihood}} + \underbrace{\operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]}_{\text{Entropy of Variational Dist.}}}_{\text{ELBO / Negative Variational Free Energy } \mathcal{L}(q({\mathcal{Z}}\mid {\bf \phi}))} \\
\end{align}

Hence, we get an ***Evidence Lower Bound (ELBO)*** (also known as the ***Negative Variational Free Energy***) on the $\log$ Evidence. Instead of an inequality, we can get an exact equality of the form below by deriving the ELBO from rearranging the KL Divergence from our variational distribution (approximate posterior of latent variables) $q({\mathcal{Z}} \mid {\bf \phi})$ to our actual posterior over latent variables $p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})$:

Derivation from ${\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))$:

\begin{align}
    {\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))
    &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log\left(\frac{q({\mathcal{Z} = z} \mid {\bf \phi})}{p({\mathcal{Z} = z} \mid {\mathcal{x}}, {\bf \theta})}\right)\right] \\
    &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] - \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{Z} = z} \mid {\mathcal{x}}, {\bf \theta})\right] \\
    &= \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log q({\mathcal{Z} = z} \mid {\bf \phi})\right] - \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{Z} = z}, {\mathcal{x}} \mid {\bf \theta})\right] + \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{x}} \mid {\bf \theta})\right] \\
    &= -\left[\operatorname {E}_{q({\mathcal{Z} = z} \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] + \operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]\right] + \operatorname{E}_{q({\mathcal{Z} = z} \mid {\bf \phi})}\left[\log p({\mathcal{x}} \mid {\bf \theta})\right] \\
    &= -\mathcal{L}(q({\mathcal{Z} = z}\mid {\bf \phi})) + \log p({\mathcal{x}} \mid {\bf \theta}) \because \text{Expectation is over latent variables }{\mathcal{Z} = z}\text{, which is independent of }{\mathcal{x}} \\
\end{align}

\begin{align}
    \therefore \log p({\mathcal{x}} \mid {\bf \theta}) 
    &= \mathcal{L}(q({\mathcal{Z}} \mid {\bf \phi})) + {\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) \\
\end{align}

Since $\log p({\mathcal{x}} \mid {\bf \theta})$ is a constant, maximizing our ELBO / Negative Variational Free Energy will be equivalent to minimizing the ${\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta}))$ (0 when $q({\mathcal{Z}} \mid {\bf \phi}) = p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})$), making our variational approximation as close as possible to the actual posterior over latents. After this procedure, our 2 tasks will look like:

- 1. Find the MLE (${\bf\theta}, {\bf\phi}$ are parameters) / MAP (${\bf\theta}, {\bf\phi}$ are random variables) estimates of the model parameters ${\bf \theta_{\rm{max}}}, {\bf \phi_{\rm{max}}}$ by maximizing the ELBO:

\begin{align}
    {\bf\theta_{\rm{max}}} &= \underset{\boldsymbol {\theta}}{\operatorname{argmax}} \log p(\mathcal{X} \mid \boldsymbol{\theta }) \\
    {\bf\theta_{\rm{max}}}, {\bf\phi_{\rm{max}}} &\approx \underset{{\bf \theta}, {\bf \phi}}{\operatorname{argmax}} \mathcal{L}(q({\mathcal{Z}} \mid {\bf \phi})) \\
    &= \underset{{\bf \theta}, {\bf \phi}}{\operatorname{argmax}} \operatorname {E}_{q({\mathcal{Z}} = z \mid {\bf \phi})} \left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }})\right] - \operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]  \\
\end{align}

In maximizing the ELBO, the first term, Expected Complete-data Log Likelihood, encourages the MLE / MAP estimates of the model parameters to be 

- 2. Find the posterior over the latent variables $\mathcal{Z}$, $p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{\rm{max}} })$ {cite}`SVIPartI61:online`:

\begin{align}
    p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{\rm{max}} }) &\approx q({\mathcal{Z}} \mid {\bf \phi}) \\
\end{align}

### Finding the ELBO Part 1: Expectation-Maximization

The EM algorithm seeks to find the MLE of the evidence / marginal likelihood / incomplete-data likelihood by iteratively applying these two steps {cite}`Expectat45:online`:

- 1. Expectation step (E step): Set the approximate posterior / variational distribution $q({\mathcal{Z}}\mid {\bf \phi}) = p(\mathcal{Z} \mid \mathcal{X}, \boldsymbol {\theta_{t} })$, where $\bf \theta_{t}$ are the previous M-step estimates of $\bf \theta$, this way the ${\rm KL}(q({\mathcal{Z}} \mid {\bf \phi}) \mid\mid p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta})) = 0$ and $\log p({\mathcal{x}} \mid {\bf \theta}) = \mathcal{L}(p({\mathcal{Z}} \mid {\mathcal{x}}, {\bf \theta_{t}}))$. Our objective is then to 

    - A. Calculate the posterior over latent variables $p(\mathcal{Z} \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})$ and 
    
    - B. Calculate $Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})$ (Expected Complete data Log Likelihood):

\begin{align}
    Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)}) &= \operatorname {E} _{p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log L({\boldsymbol {\theta }};\mathcal{X} ,\mathcal{Z} = z )\right]\, \\
    &= \operatorname {E} _{p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \right]\, \\
    &= \sum_{z \in \mathcal{Z}} p(\mathcal{Z} = z \mid \mathcal{X} ,{\boldsymbol {\theta }}^{(t)}) \log p(\mathcal{X} ,\mathcal{Z} = z \mid {\boldsymbol {\theta }}) \\
\end{align}

Notice that the only thing that is missing from $Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})$ compared to the ELBO is the entropy of the approximate posterior distribution $\operatorname{H}\left[\log q({\mathcal{Z}} \mid {\bf \phi})\right]$.

- 2. Maximization step (M step): Find the parameters that maximize $ Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})$:

\begin{align} 
    {\boldsymbol {\theta }}^{(t+1)} &= {\underset {\boldsymbol {\theta }}{\operatorname {arg\,max} }}\ Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\,
\end{align}

### Finding the ELBO Part 2: Markov Chain Monte Carlo

### Finding the ELBO Part 3: Mean-Field Approximate Variational Inference

### Finding the ELBO Part 4: Black-Box Stochastic Variational Inference

---
## Maximum A Posteriori

Before continuing, realize that because 

\begin{align}
    p(\mathcal{X}, \mathcal{Z} ; \Theta = \theta) &= p(\mathcal{Z} \vert \mathcal{X}; \Theta = \theta) p(\mathcal{X} ; \Theta = \theta)
\end{align}

\begin{align}
    p(\mathcal{Z} \vert \mathcal{X}; \Theta = \theta) &= \frac{}{}
\end{align}

\begin{align}
    &= \arg\underset{\theta \in \Theta}{\max} \frac{1}{N}\sum_{\mathbf{x} \in \mathbf{x}_{\text{train}}} \int_{\mathbf{z} \in \mathcal{Z}} \log p(\mathcal{X}=\mathbf{x}, \mathcal{Z}=\mathbf{z}; \Theta = \theta) d\mathbf{z} \\
\end{align}

```{note} Mathematical Notation

The math notation of my content, including the ones in this post follow the conventions in Christopher M. Bishop's Pattern Recognition and Machine Learning. In addition, I use caligraphic capitalized roman and capitalized greek symbols like $\mathcal{X}, \mathcal{Y}, \mathcal{Z}, \Omega, \Psi, \Xi, \ldots$ to represent **BOTH** a set of values that the random variables can take as well as the argument of a function in python (e.g. `def p(Θ=θ)`).

```



https://pyro.ai/examples/intro_long.html#Background:-inference,-learning-and-evaluation

Objective:

\begin{align}
    
\end{align}

---
## Full Bayesian Inference

We're now ready to discuss how MLE is performed in probabilistic machine learning. The key difference between 

Specifically in [Pyro](https://pyro.ai/examples/mle_map.html), to get MLE estimates of $\theta$, simply declare $\theta$ as a fixed parameter using `.param` in the `model` and have an empty `guide` (variational distribution). To get MAP estimates instead, declare $\theta$ just like a regular latent random variable by `.sample` in the `model`, but in the `guide`, declare $\theta$ as being drawn from a dirac delta function.

### Parameter Uncertainty

Frequentist: Uncertainty is estimated with confidence intervals

Bayesian: Uncertainty is estimated with credible intervals

### Prediction Intervals

---
## Empircal Bayes

### Hierarchical Bayes