## Coordinate Ascent Update for Mean-Field Variational Bayes
Variational Bayes (Variational Inference) is a functional optimization procedure designed to find the best approximating distribution, $q({\bf Z})$, that results in minimum Kullback-Leibler divergence between itself and the posterior distribution of the latent variables given some data, $p({\bf Z}\big\rvert {\bf X})$. ${\bf Z}$ and ${\bf X}$ respectively represent the collection of latent and observation variables.

### Problem Statement
We follow the interpretation presented in [1]. Let ${\bf X}$ be a set of observations (random variables), ${\bf Z}$ be some unknown random latent variables. From Bayes's theorem: 
$$p({\bf X}) = \frac{p({\bf X,\bf Z})}{p({\bf Z\big\rvert \bf X})}\\
$$
$$\hspace{4cm} \log(p({\bf X})) = \log{p({\bf X, \bf Z})} - \log{p({\bf Z\big\rvert \bf X})}\\
$$
Taking the expectation with respect to some distribution, $q({\bf Z})$, gives:

$$
\log(p({\bf X})) = E_q[\log{p({\bf X,\bf Z})}] - E_q[\log{p({\bf Z\big\rvert \bf X})}],
$$

where $E_q[f({\bf X,\bf Z})] = \int q({\bf Z})f({\bf X ,\bf Z})d{\bf Z}$ and we have used $\log(p({\bf X})) = E_q[\log(p({\bf X}))]$, since $p({\bf X})$ does not depend on ${\bf Z}$. After adding $E_q[\log(q({\bf Z}))]$ to and subtracting it from the r.h.s: 

$$
\log(p({\bf X})) = E_q[\log{\frac{p({\bf X, \bf Z})}{q({\bf Z})}}] - E_q[\log{\frac{p({\bf Z\big\rvert \bf X})}{q({\bf Z})}}].
$$

We know that the marginal distribution of the observations is constant. Therefore, minimizing $E_q[\log{\frac{p({\bf Z\big\rvert X})}{q({\bf Z})}}]$, which is the KL divergence between $q({\bf Z})$ and the posterior distribution, is equivalent to maximizing $\mathcal{L}(q) = E_q[\log{\frac{p({\bf X,Z})}{q({\bf Z})}}]$. This maximization forms the basis of Variational Bayes. 

The next section, shows an example of how to implement the VB maximization for a special case of functional, called Mean-Field approximation, where $q({\bf Z}) = \prod_{i=1}^{K}q_i({\bf Z}_i)$. Where $K$ is the number of the latent variables. 


### Maximizing $\mathcal{L}(q)$
We start by inserting the mean-field approximation into the components of $\mathcal{L}(q) =  E_q[\log{{p({\bf X,Z})}]-E_q[\log{q({\bf Z})}}]$. The following were adopted from [2]. 

$$
E_q[\log{q({\bf Z})}] = \int{q({\bf Z})\log{q({\bf Z})}}d{\bf Z}
$$
M.F.A:
$$
E_q[\log{q({\bf Z})}] =  \int{\prod_{i}q_i({\bf Z}_i)\log{\prod_{i}q_i({\bf Z}_i)}}d{\bf Z}_1\dots d{\bf Z}_K\\
                      =  \int{\prod_{i}q_i({\bf Z}_i)\sum_{i}\log{q_i({\bf Z}_i)}}d{\bf Z}_1\dots d{\bf Z}_K\\
                      =  \sum_{i}\int{\prod_{i}q_i({\bf Z}_i)\log{q_i({\bf Z}_i)}}d{\bf Z}_1\dots d{\bf Z}_K\\
                      =  \sum_{i}E_{q_i}[\log{q_i({\bf Z}_i)}]
\hspace{1cm}(1)$$ 
The term $\sum_iE_{q_i}[\log(q_i({\bf Z}_i))]$ can be decomposed into $E_{q_j}[\log(q_j({\bf Z}_j))] \sum_{i\ne j}E_{q_i}[\log(q_i({\bf Z}_i))]$. 

Another useful reformulation is obtained by applying the chain rule to $p({\bf X,Z})$.

$$
p({\bf X},{\bf Z}_1 \dots {\bf Z}_K) = p({\bf X}) p({\bf Z}_K\big\rvert{\bf Z}_1\dots {\bf Z}_{K-1},{\bf X}) p({\bf Z}_{K-1}\big\rvert{\bf Z}_1\dots {\bf Z}_{K-2},{\bf X})\dots ?\\
                                     = p({\bf X})\prod_ip({\bf Z}_i\big\rvert{\bf Z}_\bar{i},{\bf X}),
$$
The term $\sum_iE_{q}\big[\log(p{\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})\big]$ can also be further broken down as: 

$$
\sum_iE_{q}\big[\log p({\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})\big] = 
\int q({\bf Z})\log(p{\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})d{\bf Z}\\
= \int q_i({\bf Z}_i)\prod_{k\ne i}q_k({\bf Z}_k)\log(p{\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})d{\bf Z}_id{\bf Z}_{\bar{i}}\\
= \int_{{\bf Z}_i}\int_{{\bf Z}_{\bar{i}}} q_i({\bf Z}_i)\prod_{k\ne i}q_k({\bf Z}_k)\log(p{\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})d{\bf Z}_{\bar{i}}d{\bf Z}_i\\
= \int_{{\bf Z}_i} q_i({\bf Z}_i)\int_{{\bf Z}_{\bar{i}}}\prod_{k\ne i}q_k({\bf Z}_k)\log(p{\bf Z}_i)\big\rvert {\bf X},{\bf Z}_\bar{i})d{\bf Z}_{\bar{i}}d{\bf Z}_i\\
= \int_{{\bf Z}_i}q_i({\bf Z}_i)E_{\bar{q}_i}\big[p({\bf Z}_i\big\rvert {\bf X},{\bf Z}_{\bar{i}})\big]d{\bf Z}_{\bar{i}}\\
= E_{q_i}\big[E_{\bar{q}_i}\big[p({\bf Z}_i\big\rvert {\bf X},{\bf Z}_{\bar{i}})\big]\big]\hspace{1cm}(2)
$$

where ${\bf Z}_\bar{i}$ is short for the collection of all latent variables, except ${\bf Z}_i$. Using (1) and (2), we can now rewrite $\mathcal{L}(q)$: 

$$
\mathcal{L}(q) = \log(p({\bf X})) + E_{q_j}\big[E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big]\big]
                                   - E_{q_j}[\log(q_j({\bf Z}_j))] - \sum_{i\ne j}E_{q_i}[\log(q_i({\bf Z}_i))]
$$ 

The term $\log(p({\bf X}))$ does not depend on $q$ at all, while the term $\sum_{i\ne j}E_{q_i}[\log(q_i({\bf Z}_i))]$ does not depend on $q_j$. Note that the choice of $j$ is completely arbitrary. 

### Coordinate Ascent
Separating each individual $q_j$ in the expression for $\mathcal{L}(q)$ allows a step-by-step maximization along each $q_j$. This maximization is referred to as $q_j$. To find the maximum point for $q_j$, we must calculate the functional derivative of $\mathcal{L}(q)$ with respect to $q_j$. 

$$
\frac{\partial\mathcal{L}(q)}{\partial q_j} = \frac{\partial}{\partial q_j}\Bigg[\int q_{j}({\bf Z}_j)E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big]d{\bf Z}_j
                                   - \int q_j({\bf Z}_j)\log(q_j({\bf Z}_j))d{\bf Z}_j\Bigg] = 0 \\
                                   E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big] - \log(q_j({\bf Z}_j)) - 1 = 0
$$
where we have used the well-known results functional derivatives for entropy (i.e., $\frac{\partial }{\partial p}\int p\log(p)dx = \log(p)+1$). The stationary point from above results in: 

$$
\log(q_j({\bf Z}_j)) + \log(e) = E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big]\\
\log(q_j({\bf Z}_j)e) = E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big]\\
q_j({\bf Z}_j) \propto \exp[E_{\bar{q}_j}\big[p({\bf Z}_j\big\rvert {\bf X},{\bf Z}_{\bar{j}})\big]]
$$
The r.h.s does not depend on $q_j$. 

### References
[1] Bishop, C., 2007. Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. 2006. corr. 2nd printing edn. Springer, New York.

[2] Neiswanger, W., 2017. Probabilistic Graphical Models, Sprint 2017 lectures, Carnegie Mellon University. 