## **Evidence Lower Bound Objective function (ELBO)**

- The **ELBO** is the key objective function in **variational inference** (VI).
- Variational inference is a machine learning method that approximates complex probability distributions, particularly posterior distributions.

### **Derivation of ELBO**

Consider the marginal likelihood $p(x)$, also known as evidence. The marginal likelihood is the likelihood function that has been integrated over the latent space. It is the distribution of the observed data.

1) Marginalise the log density of the data distribution:
$$
log(p(x)) = log \int p(x, z) \, dz
$$

2) Introduce the approximate posterior using a simple trick.
\begin{align*}
\log p(x) &= \log \int \frac{q(z|x)}{q(z|x)} p(x, z) \, dz \\
&= \log \, \mathbb{E}_{q(z|x)} \left [\frac{p(x,z)}{q(z|x)} \right ]
\end{align*}

3) The log function is concave therefore we can apply **Jensen's inequality**:
$$
\mathbb{E}[f(x)] \geq f(\mathbb{E}[x])
$$

4) Applying Jensen's inequality gives:

\begin{align*}
\log p(x) \geq  \mathbb{E}[x]_{q(z|x)} \left [\log \frac{p(x,z)}{q(z|x)} \right ] \\
\end{align*}

The right hand side is known as the **evidence lower bound**. The ELBO is a lower bound on the log marginal likelihood of the data. By **maximising ELBO**, we are indirectly maximising the log marginal likelihood, which is the ultimate goal of generative modelling.

Further decomposition:

\begin{align*}
\log p(x) &\geq \mathbb{E}_{q(z|x)} \left[ \log \frac{p(x, z)}{q(z|x)} \right] \\
          &= \mathbb{E}_{q(z|x)} \left[ \log p(x, z) \right] - \mathbb{E}_{q(z|x)} \left[ \log q(z|x) \right] \\
          &= \mathbb{E}_{q(z|x)} \left[ \log p(x|z) \right] + \mathbb{E}_{q(z|x)} \left[ \log p(z) \right] - \mathbb{E}_{q(z|x)} \left[ \log q(z|x) \right] \\
          &= \mathbb{E}_{q(z|x)} \left[ \log p(x|z) \right] - \text{KL}(q(z|x) \, \| \, p(z)),
\end{align*}

This further decomposion of ELBO shows that maximising the ELBO involves minimising the KL divergence between the approximate posterior and the prior (the second term), whilst maximising the reconstruction accuracy (the first term).