# EM for Bayesian GMM Variational Inference 

### The Target Distribution
Recall that in our model, we suppose that our data, $\mathbf{X}=\{\mathbf{x}_1, \ldots, \mathbf{x}_K\}$ is drawn from the mixture of $K$ number of Gaussian distributions. For each observation $\mathbf{x}_n$ we have a latent variable $\mathbf{z}_n$ that is a 1-of-$K$ binary vector with elements $z_{nk}$. We denote the set of latent variable by $\mathbf{Z}$. Recall that the distibution of $\mathbf{Z}$ given the mixing coefficients, $\pi$, is given by
\begin{align}
p(\mathbf{Z} | \pi) = \prod_{n=1}^N \prod_{k=1}^K \pi_k^{z_{nk}} 
\end{align}
Recall also that the likelihood of the data is given by,
\begin{align}
p(\mathbf{X} | \mathbf{Z}, \mu, \Lambda) =\prod_{n=1}^N \prod_{k=1}^K \mathcal{N}\left(\mathbf{x}_n| \mu_k \Lambda^{-1}_k\right)^{z_{nk}}
\end{align}
Finally, in our basic model, we choose a Dirichlet prior for $\pi$ 
\begin{align}
p(\pi) = \mathrm{Dir}(\pi | \alpha_0) = C(\alpha_0) \prod_{k=1}^K \pi_k^{\alpha_0 -1},
\end{align}
where $C(\alpha_0)$ is the normalizing constant for the Dirichlet distribution. We also choose a Normal-Wishart prior for the mean and the precision of the likelihood function
\begin{align}
p(\mu, \Lambda) = p(\mu | \Lambda) p(\Lambda) = \prod_{k=1}^K \mathcal{N}\left(\mu_k | \mathbf{m}_0, (\beta_0\Lambda_k)^{-1}\right) \mathcal{W}(\Lambda_k|\mathbf{W}_0, \nu_0).
\end{align}
Thus, the joint distribution of all the random variable is given by
\begin{align}
p(\mathbf{X}, \mathbf{Z}, \pi, \mu, \Lambda) = p(\mathbf{X} | \mathbf{Z}, \mu, \Lambda) p(\mathbf{Z} | \pi) p(\pi) p(\mu | \Lambda) p(\Lambda)
\end{align}

### Variational Approximation
We consider a variational distribution which factorizes the latent variables and the parameters
\begin{align}
q(\mathbf{Z}, \pi, \mu, \Lambda) = q(\mathbf{Z})q(\pi, \mu, \Lambda).
\end{align}
Recall that we can decompose the log marginal probability as follows
\begin{align}
\ln p(\mathbf{X}) = \mathcal{L}(q) + \mathrm{KL}(q\| p),
\end{align}
where $\mathrm{KL}(q\| p)$ is the Kullback–Leibler divergence of $q$ and $p$ and
\begin{align}
\mathcal{L}(q) = \int q(\mathbf{Z}) \ln\left \{\frac{p(\mathbf{X}, \mathbf{Z})}{p(\mathbf{Z})} \right\}d \mathbf{Z}.
\end{align}
When $q(\mathbf{Z})$ is exactly equal to the posterior distribution $p(\mathbf{Z} | \mathbf{X})$ we have that $\mathrm{KL}(q\| p) = 0$. Thus, fitting our approximate distribution $q$ to the target distribution $p$ is a matter of minimizing $\mathrm{KL}(q\| p)$. For our  factorized distribution, $q$, minimizing $\mathrm{KL}(q\| p)$ is equivalent to maximizing $\mathcal{L}(q)$ (see Chapter 10 in Bishop for details). Finally, the solution of both optimization problem, $q^*$, occurs when
\begin{align}
\ln q^*(\mathbf{Z}) &= \mathbb{E}_{\pi, \mu, \Lambda} \left[\ln p(\mathbf{X}, \mathbf{Z}, \pi, \mu, \Lambda) \right] + \text{const}\\
\ln q^*(\pi, \mu, \Lambda) &= \mathbb{E}_{\mathbf{Z}} \left[\ln p(\mathbf{X}, \mathbf{Z}, \pi, \mu, \Lambda) \right] + \text{const}
\end{align}
We will alternatingly re-estimate the parameters of an approximation $q^(i)$ of $q^*$ according to the following update rules. 

### Variational E-step
At the $i$-th step, we have that
\begin{align}
\ln q^*(\pi) = \mathrm{Dir}(\pi | \alpha),
\end{align}
where
\begin{align}
\alpha &= (\alpha_k)_{k=1}^K,\;\; \alpha_k = \alpha_0 + N_k\\
N_k &= \sum_{n=1}^N r_{n, k}
\end{align}
We also have 
\begin{align}
q^*(\mu_k, \Lambda_k) = \mathcal{N}(\mu_k | \mathbf{m}_k, (\beta_k\Lambda_k)^{-1})\mathcal{W}(\Lambda_k | \mathbf{W}_k, \mu_k)
\end{align}
where 
\begin{align}
\mathbf{m}_k &= \frac{1}{\beta_k} (\beta_0\mathbf{m}_0 + N_k\overline{\mathbf{x}}_k)\\
\beta_k &= \beta_0 + N_k\\
\mathbf{W}_k^{-1} &=\mathbf{W}_0^{-1} + N_k \mathbf{S}_k + \frac{\beta_0N_k}{\beta_0+N_k}(\overline{\mathbf{x}}_k - \mathbf{m}_0)(\overline{\mathbf{x}}_k - \mathbf{m}_0)^\top\\
\nu_k &= \nu_0 + N_k + 1\\
\overline{\mathbf{x}}_k &= \frac{1}{N_k} \sum_{n=1}^N r_{nk}\mathbf{x}_n\\
\mathrm{S}_k &= \frac{1}{N_k} \sum_{n=1}^N r_{nk} (\mathbf{x}_k - \overline{\mathbf{x}}_k)(\mathbf{x}_k - \overline{\mathbf{x}}_k)^\top
\end{align}


### Variational M-step
At the $i$-step, we have that
\begin{align}
q^*(\mathbf{Z}) = \prod_{n=1}^N\prod_{k=1}^K r_{nk}^{z_{nk}},
\end{align}
where 
\begin{align}
r_{nk} &= \frac{\rho_{nk}}{\sum_{j=1}^K \rho_{nj}}\\
\rho_{nk} &= \exp\left\{-\frac{1}{2} \left[D\beta_k^{-1} + \nu_k(\mathbf{x}_n - \mathbf{m}_k)^\top \mathbf{W}_k (\mathbf{x}_n - \mathbf{m}_k) \right] \right\}
\end{align}
where $D$ is the dimension of each vector $\mathbf{x}_n$.
