Please consult the end of chapter 9.4 of _Pattern Recognition and Machine Learning_.

# Motivation for Online Mixture of Gaussians

Online algorithms consider a single data point at a time rather than an entire batch. Online could also be called "incremental." 

When applying Expectation Maximization to a gaussian mixture given large amounts of data, the batch method's calculation time depends on the number of data points. In the online formulation, the Expectation and Maximization steps both take fixed time since they operate on only a single data point. This can provide a significant performance boost when you don't need to consider the entire batch of data every iteration.

In the online formulation of Mixture of Gaussians, the parameters are updated incrementally. That means the algorithm can converge more quickly than a batch approach.

# Derivation of Online Gaussian Mixture

We derive 9.78 & 9.79 in the text. The purpose is to define an incremental update step to the mean of a gaussian. We start from 9.18, the definition of the mean.

\begin{align}
    \mathbf{\mu}_k &= \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk}) \mathbf{x}_n \\
    N_k &= \sum_{n=1}^{N} \gamma(z_{nk})
\end{align}

### Updating N
We will consider an update technique where we recompute the responsibilities for a single data point, $\mathbf{x}_m$. We initialize by using our definition from above.

\begin{align}
    N_{k}^{old} &= \sum_{n} \gamma^{old}(z_{nk}) \\
    N_{k}^{new} &= \sum_{n \neq m} \gamma^{old}(z_{nk}) + \gamma^{new}(z_{mk})
\end{align}

We can use these results, subtitute $N_{k}^{old}$ into the ladder equation to retrieve 9.79. (Subtract out when $m = n$ to convert $N_{k}^{old}$ to the sum in the ladder equation.)

\begin{equation}
    N_{k}^{new} = N_{k}^{old} + \gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})
\end{equation}

### Updating the mean
Now, let us derive the mean update. We use a similar update technique.

\begin{align}
    \mathbf{\mu}_k^{old} &= \frac{1}{N_k^{old}} \sum_{n=1}^{N} \gamma^{old} (z_{nk}) \mathbf{x}_n
\end{align}

Now we recompute the responsibilities, $\gamma(z_{mk})$, from a single point.

\begin{align}
    \mathbf{\mu}_k^{new} &= \frac{1}{N_k^{new}} \Big(
                            \sum_{n \neq m} \gamma^{old}(z_{nk}) \mathbf{x}_n + \gamma^{new}(z_{mk}) \mathbf{x}_m \Big) \\
    &= \frac{1}{N_k^{new}}
        \Big( N_k^{old} \mathbf{\mu}_k^{old} - \gamma^{old}(z_{mk}) \mathbf{x}_m + \gamma^{new}(z_{mk}) \mathbf{x}_m \Big) \\
    &= \frac{1}{N_k^{new}}
        \Big( \big ( N_{k}^{new} - \gamma^{new}(z_{mk}) + \gamma^{old}(z_{mk}) \big ) \mathbf{\mu}_k^{old} - \gamma^{old}(z_{mk}) \mathbf{x}_m + \gamma^{new}(z_{mk}) \mathbf{x}_m \Big) \\
    &= \mathbf{\mu}_k^{old} + 
        \Big( \frac{\gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})}{N_{k}^{new}} \Big)
        (\mathbf{x}_m - \mathbf{\mu}_k^{old})
\end{align}

### Updating the covariances

Similar update but for covariances

\begin{align}
    \mathbf{\Sigma}_k^{old} &= \frac{1}{N_k^{old}} \sum_{n} \gamma^{old} (z_{nk}) (\mathbf{x}_n - \mathbf{\mu}_k^{old})
        (\mathbf{x}_n - \mathbf{\mu}_k^{old})^{T}
\end{align}

Now we recompute the responsibilities, $\gamma(z_{mk})$, from a single point.

\begin{align}
    \mathbf{\Sigma}_k^{new} &= \frac{1}{N_k^{new}} \sum_{n \neq m} \Big(
        \gamma^{new} (z_{nk}) (\mathbf{x}_n - \mathbf{\mu}_k^{new}) (\mathbf{x}_n - \mathbf{\mu}_k^{new})^{T} \Big) \\
    &= \frac{1}{N_k^{new}} \sum_{n \neq m} \Big(
        \gamma^{old} (z_{nk}) (\mathbf{x}_n - \mathbf{\mu}_k^{new}) (\mathbf{x}_n - \mathbf{\mu}_k^{new})^{T} +
        \gamma^{new} (z_{mk}) (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T} \Big) \\
    &= \frac{1}{N_k^{new}} \Big(
        N_k^{old} \mathbf{\Sigma}_k^{old} - 
        \gamma^{old} (z_{mk}) (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T} +
        \gamma^{new} (z_{mk}) (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T} \Big) \\
\end{align}

For space, define 

\begin{align}
    \mathbf{A} &= (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T}
\end{align}

\begin{align}
    \mathbf{\Sigma}_k^{new} &=
        \frac{1}{N_k^{new}} \Big(
        N_k^{old} \mathbf{\Sigma}_k^{old} - 
        \gamma^{old} (z_{mk}) \mathbf{A} +
        \gamma^{new} (z_{mk}) \mathbf{A} \Big) \\
    &= \frac{1}{N_k^{new}} \Big(
        N_{k}^{new} \mathbf{\Sigma}_k^{old} - 
        \gamma^{new}(z_{mk}) \mathbf{\Sigma}_k^{old} + 
        \gamma^{old}(z_{mk}) \mathbf{\Sigma}_k^{old} - 
        \gamma^{old} (z_{mk}) \mathbf{A} +
        \gamma^{new} (z_{mk}) \mathbf{A} \Big) \\ 
    &= \mathbf{\Sigma}_k^{old} +
        \Big( \frac{\gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})}{N_{k}^{new}} \Big)
        (\mathbf{A} - \mathbf{\Sigma}_k^{old}) \\
    &= \mathbf{\Sigma}_k^{old} +
        \Big( \frac{\gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})}{N_{k}^{new}} \Big)
        \Big( (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T} - \mathbf{\Sigma}_k^{old} \Big) \\
\end{align}

### Updating the mixing coefficients

Update for mixing coefficients. $\pi_{k} = \frac{N_{k}}{N}$.

\begin{equation}
    \pi_{k}^{old} = \frac{N_{k}^{old}}{N}
\end{equation}

Update the responsibilities...

\begin{align}
    \pi_{k}^{new} &= \frac{N_{k}^{new}}{N} \\
        &= \frac{1}{N} \sum_{n \neq m} \gamma^{old}(z_{nk}) + \gamma^{new}(z_{mk}) \\
        &= \frac{1}{N} \Big( N_{k}^{old} + \gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk}) \Big)
\end{align}

# Algorithm for Online Gaussian Mixture

1. As usual, initalize the means $\mathbf{\mu}_k$, covariances $\Sigma_k$, and mixing coefifients $\pi_k$, and evaluate the inital value of the log likelihood. In addition, initalize the responsibilities, $\gamma(z_{nk})$ to random numbers between 0 and 1. Use this to compute your initial $N_k$

    \begin{equation}
        N_{k} = \sum_{n=1}^{N} \gamma(z_{nk}).
    \end{equation}
    
2. **E Step** Evaluate the responsibilities using the current parameter values for a single data point

    \begin{equation}
        \gamma(z_{mk}) = \frac{\pi_{k} 
                                \mathcal{N}( \mathbf{x}_m \mid \mathbf{\mu}_k, \mathbf{\Sigma}_k)
                                }{\Sigma_{j=1}^{K} \pi_{j} \mathcal{N}( \mathbf{x}_n \mid \mathbf{\mu}_j, \mathbf{\Sigma}_j)}
    \end{equation}

3. **M Step** Update the parameters using the current responsibilities.

    \begin{align}
        N_{k}^{new} &= N_{k}^{old} + \gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk}) \\
        \mathbf{\mu}_k^{new} &= \mathbf{\mu}_k^{old} + 
            \Big( \frac{\gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})}{N_{k}^{new}} \Big)
            (\mathbf{x}_m - \mathbf{\mu}_k^{old}) \\
        \mathbf{\Sigma}_k^{new} &= \mathbf{\Sigma}_k^{old} +
            \Big( \frac{\gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk})}{N_{k}^{new}} \Big)
            \Big( (\mathbf{x}_m - \mathbf{\mu}_k^{new}) (\mathbf{x}_m - \mathbf{\mu}_k^{new})^{T} - \mathbf{\Sigma}_k^{old} \Big) \\
        \pi_{k}^{new} &= \frac{1}{N} \Big( N_{k}^{old} + \gamma^{new}(z_{mk}) - \gamma^{old}(z_{mk}) \Big)
    \end{align}
    
4. Evaluate the log likelihood

    \begin{equation}
        \ln p(\mathbf{X} \mid \mathbf{\mu}, \mathbf{\Sigma}, \mathbf{\pi}) =
            \sum_{n=1}^{N} \ln 
            \Big\{ \sum_{k=1}^{K} \pi_{k} \mathcal{N}(\mathbf{x}_n \mid \mathbf{\mu}_k, \mathbf{\Sigma}_{k} \Big\}
    \end{equation}
    
    and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.