# Approximate Inference



For most interesting models the evidence is intractable

$$
p(\mathcal{D}|\mathcal{M}_i) = \int p(\mathcal{D}|\mathcal{M}_i, \theta) p(\theta| \mathcal{M}_i) d\theta
$$

We resort to MCMC or approximate inference

## Laplace method

Propose a function of $\theta$

$$
g(\theta) = \log p(\mathcal{D}|\mathcal{M}_i, \theta) p(\theta| \mathcal{M}_i)
$$

We can do a second order Taylor expansion on $\theta= \hat \theta_{\text{map}}$

$$
g(\theta) \approx g(\hat \theta_{\text{map}}) + (\theta - \hat \theta_{\text{map}})^T \frac{dg}{d\theta}(\hat \theta_{\text{map}}) + \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \frac{d^2 g}{d\theta^2} (\hat \theta_{\text{map}}) (\theta - \hat \theta_{\text{map}})
$$

By definition the first derivative is zero, and calling $\Sigma^{-1} = -\frac{d^2 g}{d\theta^2} (\hat \theta_{\text{map}})$

$$
g(\theta) \approx g(\hat \theta_{\text{map}}) -  \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \Sigma^{-1} (\theta - \hat \theta_{\text{map}})
$$

Plugging the approximation on the evidence

$$
p(\mathcal{D}|\mathcal{M}_i) \approx  e^{g(\hat \theta_{\text{map}})} \int e^{-  \frac{1}{2} (\theta - \hat \theta_{\text{map}})^T \Sigma^{-1} (\theta - \hat \theta_{\text{map}})} d\theta 
$$

The solution of the integral is normalizing constant of a Multivariate Gaussian with $K$ parameters

$$
\log p(\mathcal{D}|\mathcal{M}_i) \approx \log p(\mathcal{D}|\mathcal{M}_i, \hat \theta_{\text{map}}) + \log p(\hat \theta_{\text{map}}| \mathcal{M}_i) + \frac{K}{2} \log(2\pi) + \frac{1}{2} \log | \Sigma |
$$

The evidence is approximate by the best likelihood fit plus the occam factor

The occam factor depends on the
- second derivative of the posterior (model uncertainty)
- number of parameters (complexity)
- and the prior pdf

If the prior is very broad and $N$ is very large we recover the **Bayesian Information Criterion** (BIC) Proof?

## Variational Inference

We want the posterior

$$
p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})}
$$

We have the likelihood and the prior but the evidence is intractable

Let's propose an approximate posterior

$$
q_\eta(\theta)
$$

This posterior can be tuned by changing the hyperparameter $\eta$

We turn this into an optimization problem: Find $\eta$ that makes $q$ most similar to $p$

We can write this as a KL divergence

$$
D_{KL}[q_\eta(\theta) || p(\theta|\mathcal{D})] = \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\theta|\mathcal{D})} d\theta
$$

We can use Bayes Theorem and move the evidence out from the integral

$$
D_{KL}[q_\eta(\theta) || p(\theta|\mathcal{D})] = \log p(\mathcal{D}) + \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta
$$

> If we are minimizing with respect to $\eta$ we can ignore the evidence

Also note that because the KL divergence is non-negative then
$$
\log p(\mathcal{D}) \geq  \mathcal{L}(\eta) = - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta
$$

hence $\mathcal{L}(\eta)$ is called the **Evidence Lower BOund** (ELBO)

> Minimizing the KL is equivalent to maximizing the ELBO

Let's work the ELBO 

$$
\begin{align}
\mathcal{L}(\eta) &= - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta \nonumber \\
&= - \int q_\eta(\theta) \log \frac{q_\eta(\theta)}{ p (\theta)} d\theta + \int q_\eta(\theta) \log p(\mathcal{D}|\theta) d\theta \nonumber \\
&= - D_{KL}[q_\eta(\theta) || p(\theta)] + \mathbb{E}_{\theta \sim q_\eta(\theta)} \left[\log p(\mathcal{D}|\theta)\right]\nonumber 
\end{align}
$$

> Maximizing the ELBO is equivalent to:
- Maximize the log likelihood when sampling from the approximate posterior: maximum likelihood
- Minimize KL between approximate posterior and prior: regularization
