# Chapter 7: Variational methods

## 1. Modeling interactions using hidden variables

The complex interactions among observed variables can be modeled/encoded by hidden (latent) variables. Marginalizing/intergating over the hidden variables can induce complex correlations in the remaining (observed) variables. In clustering, the observed variable is the data itself. The latent variables are the class assignment of the data. In topic modeling, the observed variables are the words in the documents. The latent variables are the topic assignment of a word in a document or topic distribution of a document.


## 2. Expectation Maximization algorithm

### 2.1 Mathematical formulation

Assume the unobserved $z$ and observed variables $x$ follow the true distribution $p(z, x \mid \theta)$ and we hope to approximate it with a variational distribution $q(z\mid x)$. The free energy $F_q(\theta, x)$ can be written as

\begin{align}
F_q(\theta, x) &= -\langle \log p(z, x \mid \theta)\rangle_q - H_q \\
&= -\int dz q(z\mid x) \log p(z, x \mid \theta) + \int dz q(z\mid x)\log q(z\mid x)
\end{align}

where the negative log likelihood acts as the energy term of the free energy. The entropy term $H_q$ can be understood as we want to maximize the log likelihood using a distribution with enough variation (more uncertainty in the random variable). The (true) free energy $F_p(\theta, x)$ can be written as

\begin{equation}
F_p(\theta, x) = -\log p(x \mid \theta)
\end{equation}

There is no entropy term since there is no variational distribution. The $x$ in the free energies can be thought of as quenched disorder $J$ in spin model and $z$ is like the spin itself. Note that the term free energy is used loosely here - despite having resemblance with the free energy in statistical physics, they are not strictly the same. 

It can be shown that the variational distribution $q(z\mid x)$ is the distribution that minimize $F_q(\theta, x)$, given a value of $\theta^{(t-1)}$ and has the form

\begin{align}
q(z\mid x) = \arg\min_q F_q(\theta^{(t-1)}, x) = p(z \mid x, \theta^{(t-1)}) \label{E_step}\tag{Eq 2.1.1}
\end{align}

where $p(z \mid x, \theta) = p(z,x\mid \theta)/p(x\mid \theta)$ via Bayes' theorem. The above equation is the Expectation step of the EM algorithm. Once $q(z\mid x)$ is derived/updated, $\theta$ can also be updated,

\begin{align}
\theta^{(t)} = \arg\min_\theta F_q(\theta, x)
\end{align}

which is the Maximization step of EM. 

Even the variational distribution $q$ in \ref{E_step} is equated with the conditional probability of the true distribution $p$, the distribution is not 'true' since $\theta$ is not optimized. Therefore, $q$ is variational in the sense that it is only an approxiation of $p$ given $\theta \neq \theta^*$. Technically, the above example does not involve a variational distribution since functionally it is still the same as the true probability. As we will see in the spin glass and LDA example, the variational distribution can also have a very different functional form from the true probability. 

The difference between $F_p(\theta, x)$ and $F_q(\theta, x)$ is related by the KL divergence

\begin{equation}
D_{KL}(q(z\mid x) \mid\mid p(z\mid x, \theta)) = F_q(\theta, x) - F_p(\theta, x) \label{KL_free_energy}\tag{Eq 2.1.2}
\end{equation}

Since the variational free energy $F_q$ is always the upper bound of the true free energy $F_p$, minimizing $F_q$ using the above EM algorithm is equivalent to minimizing the KL divergence. This ensures $F_q$ evolves closer to the true free energy. Note that direct computation of KL divergence involves computing the marginal distribution $p(x\mid \theta)$ (due to Bayes' rule $p(z\mid x,\theta)=p(z,x\mid \theta)/p(x\mid \theta)$), which is intractable. However, minimizing KL divergence is the same as minimizing the variational free energy $F_q$, which involves the joint distribution $p(x,z\mid \theta)$, which does not require marginalization and is thus tractable.

### 2.2 Example 1: Mean-Field approximation in spin glass model

The probability distribution of an spin glass model spin configuration $s$ is  

\begin{align}
p(s)&=\exp(-\beta E(s))/Z, \ \ \text{where}  \\
Z&=\text{Tr}_{\{s\}}\exp(-\beta E(s)) \\
E(s) &= \sum_{i,j}J_{ij}s_i s_j - \sum_{i}h_i s_i
\end{align}

$Z$ is in general computationally intractable (number of terms $\sim O(2^N)$). Therefore $p(s)$ is difficult in compute. The purpose of variational inference is to approximate $p(s)$ with a simplier (factorized) *variational* distribution $q(s,\theta)$:  

$q(s,\theta)=\exp(\sum_i s_i\theta_i)/Z_q$.  

The variational distribution assumes there is no correlation among $s_i$ and the distribution is thus factorizable. This is equivalent to the assumption of MFT. The variational inference of $q(s, \theta)$ is done by minimizing the difference between the free energy under $p$, $F_p(\beta)$ and the free energy under $q$, $F_q(\theta)$. The two free energies are related via the KL-divergence:  

$F_q(\theta) = F_p(\beta) + D_{KL}(p\mid\mid q)$  

Therefore this is equivalent to minimizing the KL-divergence. The EM algorithm results in the follow pair of equations

\begin{align}
&\theta_i=\beta \sum_j J_{ij}s_j + h_i \\
&s_i=\tanh(\theta_i)
\end{align}

The first equation is the *maximization* step and the second the *expectation* step. The two equations are updated asynchronous fashion until convergence.

### 2.3 Example 2: LDA


#### Inference problem

The inference problem in LDA is to figure out the probability of topic $z_i$ of each word $w_i$

\begin{align}
p(z,\theta \mid w,\alpha, \beta) = \frac{p(z, w, \theta \mid\alpha, \beta)}{p(w\mid\alpha, \beta)}
\end{align}

$p(w \mid \alpha, \beta)$ is marginalized over the hidden variables $\theta$ and $z$ and is intractable due to the coupling between $\beta$ and $\theta$. The above equation can be solved by assuming a variational distribution which does not mix $\beta$ and $\theta$. Instead of calculating $p(z,\theta \mid w,\alpha, \beta)$, we instead calculate the following variational distribution

\begin{align}
q(z,\theta \mid \gamma, \phi) = q(\theta \mid \gamma)\prod_{j=1}^N q(z_j \mid \phi_j)
\end{align}

$\gamma$ and $\phi$ can be solved by minimizing the KL divergence between $q(z,\theta \mid \gamma, \phi)$ and $p(z,\theta \mid w,\alpha, \beta)$ using EM algorithm. From the previous discussion, minimizing KL divergence directly is intractable. However, this is equivalent to minimizing $F_q$

\begin{equation}
F_q = -\langle P(z, w, \theta \mid \alpha, \beta)\rangle_q - \langle \log q(\theta, z)\rangle_q
\end{equation}

which only involves the joint probability $P(z, w, \theta \mid \alpha, \beta)$ which does not involve marginalization and is thus tractable.

#### parameter estimation

The parameter estimation problem in LDA is to estimate $\alpha$ and $\beta$ by maximizing the likelihood of observing all the words

\begin{equation}
l(\alpha, \beta) = \sum_{d=1}^D \log p(w_d \mid \alpha, \beta)
\end{equation}

Again, the computation of $p(w_d \mid \alpha, \beta)$ is intractable and variational methods used in the inference problem can be leveraged here.

Note that EM works for parametric models (number of parameters does not scale with the number of data). In case of non-parametric models (e.g. DP, dimension of $\theta$ scales with $N$), other techniques such as MCMC can be used.

## Appendix

### KL divergence and free energy

Another derivation of the relationship between KL divergence and free energy is as follow. Starting from the expression of KL divergence measuring the distance between $q(x)$ and $p(x)$

\begin{align}
D_{KL}(q(x)\| p(x)) = \sum_{x} q(x) \ln\Big(\frac{q(x)}{p(x)}\Big)
\end{align}

Assuming the distribution $p(x)$ follows a Boltzmann distribution $p(x) = \exp(-\beta E(x))/Z$

\begin{align}
D_{KL}(q(x)\| p(x)) &= \sum_{x} q(x) \ln\Big(\frac{Z q(x)}{\exp(-\beta E(x))}\Big) \\
&= \sum_{x} q(x) \Big(\ln Z + \ln q(x) -\beta E(x)\Big) \\
&= -\Big(\sum_{x} q(x)E(x) - \sum_{x} q(x)\ln q(x)\Big) + \ln Z \\
&= F + \ln Z
\end{align} 

Therefore, KL divergence is minimum when the Helmholtz free energy $F$ equals to $-\ln Z$.