# Gaussian Mixture Model

## Overview

K-means produces clusters whereby each sample belong to one cluster only and the produced clusters are more or less of equal spatial length.
However, in practice this may not be justified. In this section we will doscuss a clustering methodology where each sample belongs to all clusters with a given probability.
Namely, we will discuss <a href="https://scikit-learn.org/stable/modules/mixture.html">Gaussian mixture model</a> or GMM clustering. In this approach a cluster
is represented by using three elements; mean, variance and a weight. In this method each sample belongs to all clusters with a given probability.

## Gaussian mixture model

Let's consider a dataset $\mathbf{D}$. As always, a generating process $p_{\mathbf{D}}$ is implied about the data. We assume that the whole
distribution is generated by the sum of $k$ Gaussian distributions. Therefore, the probability of observing each sample 
given the cluster $k$ can be expressed according to


\begin{equation}
p(\mathbf{d}_i | C=k) = \sum_{j=1}^{k}w_j N\left(\mathbf{d}_i|\mu_j, \Sigma_j\right)
\end{equation}

where $w_j$ is the weight associated with the $jth$ Gaussina distribution. In order for the expression above to represent a true probability,
we need to have 

\begin{equation}
\sum_j w_j = 1
\end{equation}

Clustering of a point can then be done accoridng to

\begin{equation}
C = argmax_{\mathbf{d}} p(\mathbf{d}_i | C=k)
\end{equation}

The question now arises how do we estimate the weights $w_j$? The classical and natural method for computing the
maximum-likelihood estimates (MLEs) for mixture distributions is the <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">Expectation-Maximazation</a> algorithm. We will not go into details here as the calculations can become rather lengthy.

In the E-step, the algorithm apportions the unit weight of an observation
in class k to the various subclasses assigned to that class. If it is close to the
centroid of a particular subclass, and far from the others, it will receive a
mass close to one for that subclass. On the other hand, observations halfway
between two subclasses will get approximately equal weight for both.
In the M-step, an observation in class k is used Rk times, to estimate the
parameters in each of the Rk component densities, with a different weight
for each. The EM algorithm is studied in detail in Chapter 8. The algorithm
requires initialization, which can have an impact, since mixture likelihoods
are generally multimodal. Our software (referenced in the Computational
Considerations on page 455) allows several strategies; here we describe the
default. The user supplies the number Rk of subclasses per class. Within
class k, a k-means clustering model, with multiple random starts, is fitted
to the data. This partitions the observations into Rk disjoint groups, from
which an initial weight matrix, consisting of zeros and ones, is created.

## Summary

GMM  can also be thought of as a prototype method, similar in spirit to K-means;
each cluster is described in terms of a Gaussian density, which has a centroid, just like in K-means, and a covariance matrix. 

res in
some detail in Sections 6.8, 8.5 and 12.7. The comparison becomes crisper if we restrict the component
Gaussians to have a scalar covariance matrix (Exercise 13.1). The two steps
of the alternating EM algorithm are very similar to the two steps in K-
means:

## References