# Variational Bayesian Methods 

**Definition** Let $P, Q$ be probability measures over a set $\mathcal{X}$. $Q$ is absolutely continuous with respect to $P$, then we define the Kullback-Leibler divergence from $P$ to $Q$ as: 

$$ 
D_{KL}\big (Q \parallel P \big) = \int_{\mathcal{X}} \mathrm{log} \bigg ( \frac{dQ}{dP} \bigg) dQ
$$

where $\frac{dQ}{dP}$ is the Radon-Nikodym derivative of $Q$ with respect to $P$, provided the right hand side exists. 

**Problem setup** 

**Remark** When talking for example about distribution $P_{X}(x)$ we mean distribution that has density $P_{X}(x)$ with respect to some measure that we neglect to specify. 

Generally in variational methods we have some observed value $x^{obs}$ of random variable $X$ and latent (unobserved) random variables $Z$. We know prior distribution $P_{Z}(z)$ and likelihood $P_{X \mid Z}(x \mid z)$. We want to estimate posterior distribution $P_{Z \mid X}(z \mid x)$. Standard Bayesian rule gives us 

\begin{equation}
P_{Z \mid X}(z \mid x) = \frac{P_{X \mid Z}(x \mid z) P_{Z}(z)}{\int P_{X \mid Z}(x \mid z) P_{Z}(x)dz}
\label{eq:bayes_rule}
\tag{1}
\end{equation}

There are multiple ways how to solve this (Gibbs sampling), but sometimes it can be hard. What variational methods propose is to forget about the equation \eqref{eq:bayes_rule} and instead try to aproximate the posterior $P_{Z\mid X}(z \mid x)$ by some distribution $Q_{\eta}(z)$ that belongs to some sufficiently rich parametric family of distributions (parametrized by $\eta$). 

**Variational Bayes Solution**

To make the approximation any good, we try to minimize Kullback-Leibler divergence $D_{KL}$ from the true $P_{Z \mid X}(z \mid x^{obs})$ to $Q_{\eta}(z)$. 

\begin{equation}
D_{KL}\big( Q_{\eta}(z) \parallel P_{Z \mid X}(z \mid x) \big) = \int Q_{\eta}(z) \mathrm{log} \bigg( \frac{Q_{\eta}(z)}{P_{Z \mid X}(z \mid x^{obs})} \bigg) dz
\label{eq:variational_KL}
\tag{2}
\end{equation}

Now we employ a trick and rewrite the right hand side of \eqref{eq:variational_KL} as 

\begin{equation}
\int Q_{\eta}(z) \Bigg [ \mathrm{log} \bigg( \frac{Q_{\eta}(z)}{P_{Z, X}(z, x^{obs})} \bigg) + \mathrm{log}\big( P_{X}(x^{obs})\big) \Bigg ] dz
\label{eq:trick_KL}
\tag{3}
\end{equation}

Let's further denote 

\begin{equation}
\mathcal{L}(Q_{\eta}) = - \int Q_{\eta}(z) \mathrm{log} \bigg( \frac{Q_{\eta}(z)}{P_{Z, X}(z, x^{obs})} \bigg) dz
\end{equation}

So from \eqref{eq:trick_KL} we have  

\begin{equation}
D_{KL}\big( Q_{\eta}(z) \parallel P_{Z \mid X}(z \mid x) \big) = \mathrm{log} \big ( P_{X}(x^{obs}) \big ) - \mathcal{L}(Q_{\eta})
\label{eq:KL_to_likelihood_and_L}
\tag{4}
\end{equation}

Now notice, that minimizing $D_{KL}$ is equivalent to maximizing $\mathcal{L}$ (sometimes called **(variational) lower bound** on the marginal likelihood or **(negative) variational free energy**) since $\mathrm{log} \big ( P_{X}(x^{obs}) \big )$ (sometimes called **log evidence**) does not depend on $Q_{\eta}$. By appropriate choice of the parametric family $Q_{\eta}$ the variational lower bound $\mathcal{L}$ becomes tractable and we are able to maximize it.