# Variational Inference

In Bayesian inference we are usually interested in the posterior distribution of the latent variables of a model. That is, we want the distributions over the model's variables after having seen some data. This can then be used for statistical predictive inference for new observed data. Having a probability distribution is nice also because it gives a measure of uncertainty instead of just a point estimate.

In some models and/or for specific assumptions in a model the posterior distribution can be analytically computed in a tractable way. However, for most interesting models this computation is intractable because the complexity of such a computation is exponential.

Instead, we try different approximation techniques to find something close to the actual posterior. One such technique is variational inference.

In variational inference the inference problem is transformed to an optimization problem instead. This is done by picking a distribution $q(z | \lambda)$ where $z$ are the latent variables and $\lambda$ are the *variational parameters*. $\lambda$ are optimized to make $q$ as close as possible to the true posterior. 

This closeness is measured via some divergence measure which in variational inference is often *Kullback-Leibler* divergence (always?). The KL divergence is a measure of difference between two probability distributions with certain properties.

\begin{align*}
KL(q(z)\ ||\ p(z|x)) &= \int q(z) log\ \frac{q(z)}{p(z|x)} dz \\
\\
KL(q(z)\ ||\ p(z|x)) &\geq 0 \\
\\
KL(q(z)\ ||\ p(z|x)) &\neq KL(p(z|x)\ ||\ q(z))
\end{align*}

In variational inference we use $KL(q(z)\ ||\ p(z|x))$, the opposite order is used in *Expectation propagation* which has some different properties (which are?).

The KL divergence can not be computed either since it contains the posterior which is what we wanted in the first place. Instead we optimize a lower bound which is described below.

## TODO
mean field variational inference and other types

laplace

jensen inequality

conjugate distributions

connection to EM

## Evidence Lower Bound (ELBO)
This is the main idea behind variational inference which transforms the problem into an optimization problem instead.

Here $z$ are the latent variables and $x$ are the observed variables. The variational parameters are left out for brevity but would otherwise be present in $q$.

\begin{align*}
KL(q(z)\ ||\ p(z|x)) &= \int q(z) log\ \frac{q(z)}{p(z|x)} dz \\
&= \int q(z) log\ \frac{q(z)}{\frac{p(z, x)}{p(x)}} dz \\
&= \int q(z) log\ \frac{q(z)p(x)}{p(z, x)} dz \\
&= \int q(z) \left( log\ \frac{q(z)}{p(z, x)} + log\ p(x) \right) dz \\
&= \int q(z) log \frac{q(z)}{p(z, x)} dz + log\ p(x)\underbrace{\int q(z) dz}_\text{1} && \text{can move x dependent part outside integral over z} \\
&= \int q(z) log\ \frac{q(z)}{p(z, x)} dz + log\ p(x) \\
\\
log\ p(x) &= KL(q(z)\ ||\ p(z|x)) - \int q(z) log\ \frac{q(z)}{p(z, x)} dz \\
&= KL(q(z)\ ||\ p(z|x)) - \int q(z) (-1) log\ \frac{p(z, x)}{q(z)} dz && \text{use log rule} \\
&= KL(q(z)\ ||\ p(z|x)) + \underbrace{\int q(z) log\ \frac{p(z, x)}{q(z)} dz}_\text{lower bound $\mathcal{L}(z)$} \\
\\
\mathcal{L}(z) &= \int q(z) \left( log\ p(z, x) - log\ q(z) \right) dz \\
&= \underbrace{\int q(z) log\ p(z, x) dz}_\text{$\mathbb{E}_{q(z)} \left[ log\ p(z, x) \right]$} - \underbrace{\int q(z) log\ q(z) dz}_\text{entropy H(q(z))} \\
\end{align*}

Now we can minimize the KL divergence by maximizing the lower bound $\mathcal{L}(z)$. We maximize $\mathcal{L}(z)$ by maximizing $\mathbb{E}_{q(z)} \left[ log\ p(z, x) \right]$.

Thus for a new model we first derive the ELBO. Then differentiate this with respect to each variational parameter to get each update equation to use when optimizing.


## Choosing the variational distribution q(z)
In practice we have to choose a familx of variational distributions for $q$ so that the parts we need (expectations) can be computed and are still expressive enough to represent what we want. Then we mazimize ELBO bx optimizing the variational parameters of $q$.

### Mean field variational inference
In mean field variational inference the assumption is that we can factorize q and letting each latent variable be independent. This is a big assumption since most models have dependency between latent variables but TODO

TODO can group variables though

\begin{equation*}
q(z_1, \dotsc, z_n) = q_1(z_1) \dotsc q_n(z_n)
\end{equation*}

We then optimize this by TODO (coordinate ascent / natural gradient)

## Example derivation for simple model
TODO

## Important extensions to variational inference

### Stochastic Variational Inference

### Structured Variational Inference

### Blackbox Variational Inference