# Approximate Inference



For many interesting models the evidence  

$$
p(\mathcal{D}|\mathcal{M}_i) = \int p(\mathcal{D}|\mathcal{M}_i, \theta) p(\theta| \mathcal{M}_i) d\theta
$$

and hence the posterior are intractable:

- The integral has no closed-form
- The dimensionality is so big that numerical integration is not feasible

We resort to stochastic or deterministic approximations

- MCMC is computationally demanding but can be exact
- VI scales better but is not exact

# Laplace Approximation


Propose a function of $\theta \in \mathbb{R}^K$

$$
g(\theta) = \log p(\mathcal{D}| \theta) p(\theta)
$$

Do a second order Taylor expansion around $\theta= \hat \theta_{\text{map}}$

$$
\begin{align}
g(\theta) \approx g(\hat \theta_{\text{map}}) &+ (\theta - \hat \theta_{\text{map}})^T \frac{dg}{d\theta}\bigg \rvert_{\theta=\hat \theta_{\text{map}}} \nonumber \\
&+ \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \frac{d^2 g}{d\theta^2} \bigg \rvert_{\theta=\hat \theta_{\text{map}}} (\theta - \hat \theta_{\text{map}})
\end{align}
$$

- By definition the first derivative evaluated at $\hat \theta_{\text{map}}$ is zero 
- We call the negative Hessian evaluated at $\hat \theta_{\text{map}}$: $\Sigma^{-1} = -\frac{d^2 g}{d\theta^2} (\hat \theta_{\text{map}})$ 

If we plug the approximation in the evidence we can solve the integral

$$
p(\mathcal{D}) \approx  e^{g(\hat \theta_{\text{map}})} \int e^{-  \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \Sigma^{-1} (\theta - \hat \theta_{\text{map}})} d\theta = e^{g(\hat \theta_{\text{map}})} (2\pi)^{K/2} |\Sigma|^{1/2}
$$

And the posterior

$$
\begin{align}
p(\theta| \mathcal{D}) &= \frac{p(\mathcal{D}|\theta) p(\theta) }{p(\mathcal{D})} \nonumber \\
&\approx \frac{1}{(2\pi)^{K/2} |\Sigma|^{1/2}} e^{-  \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \Sigma^{-1} (\theta - \hat \theta_{\text{map}})} 
\end{align}
$$

> Laplace method approximates the posterior by a **Multivariate Gaussian** centered in the MAP

Two steps
1. Find the mode (MAP)
1. Evaluate the Hessian at the mode

Note that
- We didn't assume any distribution for the prior or likelihood
- We require that $g$ is continuous and differentiable on $\theta$ 
- We also require that the negative Hessian of $g$ on the MAP is a proper covariance

#### Evidence decomposition 

Using Laplace approximation the log evidence can be decomposed as

$$
\begin{align}
\log p(\mathcal{D}|\mathcal{M}_i) &\approx g(\hat \theta_{\text{map}}) + \log (2\pi)^{K/2} |\Sigma|^{1/2} \nonumber \\
&=\log p(\mathcal{D}|\mathcal{M}_i, \hat \theta_{\text{map}}) + \log p(\hat \theta_{\text{map}}| \mathcal{M}_i) + \frac{K}{2} \log(2\pi) + \frac{1}{2} \log | \Sigma | \nonumber 
\end{align}
$$

> The log evidence is approximated by the best likelihood fit plus the Occam's factor

The Occam's factor depends on the
- log pdf of $\theta$
- number of parameters $K$
- second derivative of the posterior (model uncertainty)


If the prior is very broad and $N$ is very large we recover the **Bayesian Information Criterion** (BIC) Proof?

# Variational Inference (VI)

We want the posterior

$$
p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})}
$$

but it may be intractable

In VI a simpler (tractable) posterior distribution is proposed

$$
q_\eta(\theta)
$$

> We approximate $p(\theta|\mathcal{D})$ with $q_\eta(\theta)$

$q_\eta(\theta)$ represents a family of distributions parametrized by $\eta$

> **Optimization problem:** Find $\eta$ that makes $q$ most similar to $p$

We can write this as a KL divergence

$$
\min_\eta D_{\text{KL}}[q_\eta(\theta) || p(\theta|\mathcal{D})] = \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\theta|\mathcal{D})} d\theta
$$

This is still intractable!

To continue we use Bayes Theorem on the posterior and move the evidence out from the integral

$$
D_{\text{KL}}[q_\eta(\theta) || p(\theta|\mathcal{D})] = \log p(\mathcal{D}) + \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta
$$

> If we are minimizing with respect to $\eta$ we can ignore the evidence

The KL divergence is 
- non-negative
- zero only if $q_\eta(\theta) \equiv p(\theta|\mathcal{D})$

Using the non-negativity we find a lower bound for the evidence

$$
\log p(\mathcal{D}) \geq  \mathcal{L}(\eta) = - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta
$$

> $\mathcal{L}(\eta)$ is called the **Evidence Lower BOund** (ELBO)

- Minimizing the KL between $q$ and $p$ is equivalent to maximizing the ELBO wrt $q$
$$
\hat \eta = \text{arg}\max_\eta \mathcal{L}(\eta)
$$
- We can use $q_{\hat \eta}(\theta)$ as a drop-in replacement for $p(\theta|\mathcal{D})$
- The ELBO is tractable for simple, parametric $q$
- The ELBO can only be tight if $p$ is within the family of $q$



¿Why variational?

- Functional: Function of functions. [Calculus of variations](https://en.wikipedia.org/wiki/Calculus_of_variations): Derivatives of functionals
- [Variational Free Energy](https://en.wikipedia.org/wiki/Thermodynamic_free_energy): $-\mathcal{L}(\eta)$ 


### Another way to "obtain" the ELBO

Using Jensen's inequality on the log evidence

$$
\begin{align}
\log p(\mathcal{D}) &=  \log \mathbb{E}_{\theta\sim p(\theta)} \left[p(\mathcal{D}|\theta)\right]\nonumber \\
&=  \log \mathbb{E}_{\theta\sim q_\eta(\theta)} \left[p(\mathcal{D}|\theta)\frac{p(\theta)}{q_\eta(\theta)}\right]\nonumber \\
&\geq  \mathbb{E}_{\theta\sim q_\eta(\theta)} \left[\log p(\mathcal{D}|\theta)\frac{p(\theta)}{q_\eta(\theta)}\right] =- \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta \nonumber 
\end{align}
$$

### More attention on the ELBO 

$$
\begin{align}
\mathcal{L}(\eta) &= - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta \nonumber \\
&= - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{ p (\theta)} d\theta + \int q_\eta(\theta) \log p(\mathcal{D}|\theta) d\theta \nonumber \\
&= - D_{KL}[q_\eta(\theta) || p(\theta)] + \mathbb{E}_{\theta \sim q_\eta(\theta)} \left[\log p(\mathcal{D}|\theta)\right]\nonumber 
\end{align}
$$

> Maximizing the ELBO is equivalent to:
- Maximize the log likelihood when sampling from the approximate posterior
- Minimize the KL between the approximate posterior and prior

### Fully-factorized posterior

A simple (tractable) posterior

$$
q_\eta(\theta) = \prod_{i=1}^K q_{\eta_i}(\theta_i)
$$

- no correlation between factors
- this is known as the Mean-field VI or Mean-field Theory (physics)

Using this factorized posterior the ELBO
$$
\mathcal{L}(\eta) =  \int q_{\eta_i}(\theta_i) \left [ p(\mathcal{D}|\theta)p(\theta) \prod_{j\neq i} q_{\eta_j}(\theta_j) d\theta_j \right ] d\theta_i - \sum_i \int q_{\eta_i}(\theta_i) \log q_{\eta_i}(\theta_i)  d\theta_i
$$

We can iteratively keep all $\theta$ but $i$ fixed and update $i$

- Guaranteed convergence (convex)

- https://cedar.buffalo.edu/~srihari/CSE574/Chap4/4.5.1-BayesLogistic.pdf
- https://cedar.buffalo.edu/~srihari/CSE574/Chap4/4.5.2-VarBayesLogistic.pdf
- https://cedar.buffalo.edu/~srihari/CSE574/Chap10/10.2VariationalInference.pdf