# Expectation-Maximization Algorithm

reference: - [Expectation-Maximization Algorithm - Wikipedia ](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).

**Attention: The note I use in this notebook maybe different from the textbook.**


Given the statistical model which generates a set of observed data $\bm{X}$, a set of unobserved latent data or missing values $\bm{Z}$, and a vector of unknown parameters $\bm{\theta}$, along with a likelihood function $L(\bm{\theta}; \bm{X}, \bm{Z}) = p(\bm{X}, \bm{Z} | \bm{\theta})$, the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the marginal likelihood of the observed data
$$
L(\bm{\theta}; \bm{X}) = p(\bm{X} | \bm{\theta}) = \int_{\bm{Z}} p(\bm{X}, \bm{Z} | \bm{\theta}) d\bm{Z} = \int_{\bm{Z}} p(\bm{X} | \bm{Z}, \bm{\theta}) p(\bm{Z} | \bm{\theta}) d\bm{Z}
$$
However, this quantity is often intractable since $\bm{Z}$ is unoberseved and the distribution of $\bm{Z}$ is unknown before attaining $\bm{\theta}$.<br>
Under such situation, we define $(\bm{X},\bm{Z})$ as complete data and $L_{c}(\bm{\theta};\bm{X},\bm{Z})$ as complete-data log likelihood. For any distribution $q(\bm{Z})$, we have
$$
\left< L_{c}(\bm{\theta};\bm{X},\bm{Z})\right>_{q} = \int_{\bm{Z}} q(\bm{Z}) L_{c}(\bm{\theta};\bm{X},\bm{Z}) d\bm{Z} = \int_{\bm{Z}} q(\bm{Z}) \log p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z}
$$
By Jensen's inequality, we have
$$
\begin{align}
\log p(\bm{X}|\bm{\theta}) &= \log \int_{\bm{Z}} p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z} \\
&= \log \int_{\bm{Z}} q(\bm{Z}) \frac{p(\bm{X},\bm{Z}|\bm{\theta})}{q(\bm{Z})} d\bm{Z} \\
&\geq \int_{\bm{Z}} q(\bm{Z}) \log \frac{p(\bm{X},\bm{Z}|\bm{\theta})}{q(\bm{Z})} d\bm{Z} \\
&= \left<  L_{c}(\bm{\theta};\bm{X},\bm{Z})\right>_{q} - \left< \log q(\bm{Z}) \right>_{q}
\end{align}
$$
Furthermore, we define $H(q) = - \left< \log q(\bm{Z}) \right>_{q}$, obviously $H(q)>0$, therefore we find a lower bound of $\log p(\bm{X}|\bm{\theta})$, it can be defined as free energy $F(q,\bm{\theta})$:
$$
F(q,\bm{\theta}) = \left<  L_{c}(\bm{\theta};\bm{X},\bm{Z}) \right>_{q} + H(q)
$$
The EM algorithm can be viewed as two alternating maximization steps, that is, as an example of **coordinate descent**. Consider the free energy function:
$$
F(q,\bm{\theta}) = E_{q}(L_{c}(\bm{\theta};\bm{X},\bm{Z}) ) + H(q)
$$
where $q$ is any distribution over the latent variables $\bm{Z}$, and $H(q)$ is the entropy of $q$.<br>
This function can be written as 
$$
F(q,\bm{\theta}) = -KL(q(\bm{Z})||p(\bm{Z}|\bm{X},\bm{\theta})) + \log p(\bm{X}|\bm{\theta})
$$
where $KL(q(\bm{Z})||p(\bm{Z}|\bm{X},\bm{\theta}))$ is the Kullback-Leibler divergence between $q(\bm{Z})$ and $p(\bm{Z}|\bm{X},\bm{\theta})$.<br>
Then the steps in the EM algorithm may be viewed as:
1. E-step: $q^{(t+1)}(\bm{Z}) = \arg \max_{q(\bm{Z})} F(q,\bm{\theta}^{(t)})$
2. M-step: $\bm{\theta}^{(t+1)} = \arg \max_{\bm{\theta}} F(q^{(t+1)},\bm{\theta})$
    


Firstly, we can prove that $q^{(t+1)} = p(\bm{Z}|\bm{X},\bm{\theta}^{(t)})$.
Then we have $F(q^{(t+1)},\bm{\theta}) =  \left<  L_{c}(\bm{\theta};\bm{X},\bm{Z}) \right>_{q^{(t+1)}} + H(q^{(t+1)})$, and the M-step is to the first term of $F(q^{(t+1)},\bm{\theta})$. It can be presented as:
$$
\begin{align}
\bm{\theta}^{(t+1)} &= \arg \max_{\bm{\theta}} F(q^{(t+1)},\bm{\theta}) \\
&= \arg \max_{\bm{\theta}} \left<  L_{c}(\bm{\theta};\bm{X},\bm{Z}) \right>_{q^{(t+1)}} \\
&= \arg \max_{\bm{\theta}} \int_{\bm{Z}} q^{(t+1)}(\bm{Z}) \log p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z} \\
\end{align}
$$
For simplicity, we define $Q(\bm{\theta},\bm{\theta}^{(t)}) = \int_{\bm{Z}} q^{(t+1)}(\bm{Z}) \log p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z}= \int_{\bm{Z}} p(\bm{Z}|\bm{X},\bm{\theta}^{(t)}) \log p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z}$, then we have the equivlent form of EM algorithm:
$$
\begin{align}
&\text{E-Step:  } Q(\bm{\theta},\bm{\theta}^{(t)}) = \int_{\bm{Z}} p(\bm{Z}|\bm{X},\bm{\theta}^{(t)}) \log p(\bm{X},\bm{Z}|\bm{\theta}) d\bm{Z} = E_{\bm{Z}|\bm{X},\bm{\theta}^{(t)}}(log L(\bm{\theta};\bm{X},\bm{Z})) \\
&\text{M-Step:  } \bm{\theta}^{(t+1)} = \arg \max_{\bm{\theta}} Q(\bm{\theta},\bm{\theta}^{(t)})
\end{align}
$$