# Gaussian Mixture Models - 2

## Content

- GMM: Optimization
- Generative Methods in the context of GMM
- Maximum Likelihood Estimation in GMM
- Maximum Likelihood Estimation: Algorithm
- Visualization of GMM
- Advantages and Disadvantages of GMMs and Expectation Maximization


***

## GMM: Optimization

#### **Let's start with a 1-Dimensional Gaussian Distribution**

- That is, a Gaussian Distribution having only 1 feature.

- Let's say we have n observations of a random variable $x$ in our data:

    $x_1, x_2, x_3, ..., x_n$

- If we know that this random variable follows Gaussian Distribution $N(\mu, \sigma)$, we can estimate **Mean** $\mu$ and **Standard Deviation** $\sigma$ or **Variance** $\sigma^2$ of the Distribution as follows:

<img src='https://drive.google.com/uc?id=1YpHuscmOGZz7SAqUcggbbJC75ycQB5tk'>











#### **Now, moving on to d-Dimensional Data**

- We'll have to use **Multi-Dimensional Gaussian Distributions**.

- Let's say we again have $n$ data points in our dataset:
    
 $x_1, x_2, x_3, ..., x_n$

 Where, each $x_i$ is a d-Dimensional data point, i.e., each data point has $d$ features.

 And, the random variable $X$ can be represented as a d-Dimensional Gaussian Distribution $N_d(\mu^{(d)}, \sum^{(d \times d)})$

- Remember? **Vector of Means** $\mu^{(d)}$ and **Covariance Matrix** $\sum^{(d \times d)}$ are the **parameters of d-Dimensional Gaussian Distribution**.

- We can compute the vector of Means $\mu^{(d)}$ and the $d \times d$ Covariance Matrix $\sum^{(d \times d)}$ for Gaussian Distribution of a d-dimensional data as follows:

<img src='https://drive.google.com/uc?id=1YDl8OcDuRlbl0qo_67n9M2j8YN5trFtQ'>






### **Now, What is our objective here?**

#### **We want to find the K-Gaussians so that we can develop a Gaussian Mixture Model from the given data.**

- Given a d-Dimensional data,

  For every K, we need to compute the d-dimensional vector of means $\mu^{(d)}$ and the $d \times d$ Covariance Matrix $\sum^{(d \times d)}$
  
  That is, find the K-Gaussian Distributions.



<img src='https://drive.google.com/uc?id=1Du-ZB3JEiL7Fw2tI0QmBepzxPolMYjkn'>





***

#### **Now the questions is: How can we find these K-Gaussian Distributions?**

## Generative Methods in the context of GMM

- Generative methods start by making some assumptions on how the data has been generated.

#### **What do you think will be the assumption in Gaussian Mixture Models on how the data has been generated?**

- It assumes that the **data was generated from a Multi-modal Gaussian Distribution**.

- For K-Gaussian Distributions, there is a probability of each data point belonging to each of the K-Gaussians, which can be mathematically represented as:

 $P(Y = j) ∀j : 1 → K$

 where $Y$ is the random variable that can values from 1 to $K$,

 and $j$ refers to the $j^{th}$ Gaussian (cluster).

- It tells us what is the probability of a data point belonging to a Gaussian (or Cluster).

- This probability function can be applied to all the data points in our dataset. Essentially, it gives us the probability of a data point $x_i$ belonging to each of the $K$-Gaussians (hence, $∀j : 1 → K$) in the Gaussian Mixture Model.

<img src='https://drive.google.com/uc?id=1OsanNAzR8EM3pP2Xdp_g6P-jRfbd7WKo'>

- So, GMMs are probabilistic models that assume that the instances were generated from a mixture of several Gaussian distributions (a.k.a Normal Distributions) whose parameters are unknown.

- All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid.

- Each cluster is modelled according to a different Gaussian distribution, with different Mean $\mu_j$ and Co-Variance $\epsilon_j$.

- Each cluster can have a different ellipsoidal shape, size, density, orientation and weights relative to each other.

When you observe an instance, you know it was generated from one of the Gaussian distributions, but you are not told which one, and you do not know what the parameters of these distributions are.

Note: The computational complexity of training a GaussianMixture model depends on the number of instances `m`, the number of dimensions `n`, the number of clusters `k`, and the constraints on the covariance matrices.

Each cluster is modelled according to a different Gaussian distribution. Mathematically we can define GMM as mixture of K gaussian distribution that means it’s a weighted average of K gaussian distribution. So we can write data distribution as :

![gmm_formula.png](attachment:gmm_formula.png)

Where N(x|mu_k,sigma_k) represents cluster in data with mean mu_k and co variance epsilon_k and weight pi_k.




### **Multi-nomial Random Variable**

#### **What is the distribution of the random variable Y in case of K-Gaussian Distributions?**

- Remember what do we call a random variable which can take only two values 0 and 1 with a certain probability of taking each of the values?

  - **Bernoulli Variable**

- Similarly, a random variable $Y$ that can take values from 1 to $K$ with a probability associated with each value is called a **Multi-nomial Variable**.

<img src='https://drive.google.com/uc?id=1V3N_4q4JrEgsic2bZU2Uq7h9UO2WYldV'>

#### **Now, Let's say we know these probabilities, i.e., we know the probability of a data point coming from the $j^{th}$ Gaussian**.

- That is, $P(Y = j) ∀j : 1 → K$ is known to us.

#### **Let's say we also know the parameters of Gaussian Distribution**

- That is, the d-Dimensional vector of Means $\mu^{(d)}$ and the $d \times d$ Covariance Matrix $\sum^{(d \times d)}$ for each underlying Gaussian are known to us.

- Then we can generate a sample data point from each of the Gaussian Distributions, for example, we can generate a sample point for $Y = 1, 2, 3 ...$

<img src='https://drive.google.com/uc?id=1wzO5Zc0mtwei0wjJxWsYm_SOGjtpfzjV'>


Now, if we repeat the above steps many times, we can generate a whole set of data points, i.e., a dataset that is very similar to our original observed d-Dimensional dataset.

- This works because of the **assumption of Generative Methods in the context of GMM**, that the **data is generated from (or comes from) a d-Dimensional Multi-modal Gaussian Distribution** $N_d(\mu^{(d)}, \sum^{(d \times d)})$.

<img src='https://drive.google.com/uc?id=1BLIpEZmoXA08VAR_IAOxMtPdaAR1NgnE'>






### **Difference from Discriminative Methods**

- All methods and algorithms we had studied before this were **Discriminative Methods**.

- It means that given a data point in the dataset, all previous techniques tried to discriminate whether the point belongs to Cluster 1, or Cluster 2, ... so on.

- They did not try to generate the data. They only discriminated the existing data into different groups.

- However, now with this GMMs, we are studying **Generative Methods of Clustering**.

<img src='https://drive.google.com/uc?id=1JGxf6uPSvE4QAHXLkah347M7g8QjafV6'>


***

## Maximum Likelihood Estimation in GMM

#### **A Gaussian Mixture Model is very similar to K-Means**

<img src='https://drive.google.com/uc?id=1ZoJWbJ8tcmZECoWanozwARwL4NJ7HuJx'>

- In **GMM**, we do a **soft cluster assignment** → Because every data point has a probability of belonging to each of the Gaussians (Clusters).

- Whereas, **K-Means** is a **hard cluster assignment** → Because every data point is assigned to only one cluster.

- Under the hood, the Generative Method of determining K-Gaussians in a GMM uses an algorithm called **Maximum Likelihood Estimation**.

- As we have seen so far, a Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose **parameters** are unknown.

- The problem here is to find those parameters, i.e., for a d-Dimensional K-Gaussian Mixture Model (GMM), we need to determine all the $K$ probabilities, all the $K$ d-Dimensional vectors of Means and all the $K$ $d \times d$ Covariance Matrices.

- All these form the set of parameters $Θ$ that we need to find.

#### **Mathematical Formulation of Maximum Likelihood Estimation Problem**

So, we need to find:

 $\theta = [P(Y = j) ∀j : 1 → K; \mu^{(d)}, \sum^{(d \times d)} ∀j : 1 → K]$

<img src='https://drive.google.com/uc?id=10zEQ_TiIU9xzagh8eyoP9LGn18GWnt5N'>


- When we observe an instance, we know it was generated from one of the Gaussian distributions, but we are not told which one, and we do not know what the parameters of these distributions are.

### **Problem Statement with Maximum Likelihood Estimation**

Given the dataset $D: \{x_i\}_1^n$,

Find the parameters $\theta$, such that the probability of generating the dataset $D$ is maximum.

<img src='https://drive.google.com/uc?id=10zEQ_TiIU9xzagh8eyoP9LGn18GWnt5N'>

- This is the basic idea behind Maximum Likelihood Estimation. We need to find the set of parameters such that the probability of generating the given dataset is maximized.










It is called **Maximum Likelihood Estimation** because we find the set of parameters $\theta$, such that the probability $P(D)$ of observing the dataset $D$ is maximized.

- Now the dataset $D$ consists of data points $x_1, x_2, x_3, ..., x_n$.

- There is another basic assumption in MLE approach that all the data points are independent of each other.

- So, $\text{max}_\theta P(D) = \text{max}_\theta P(x_1, x_2, x_3, ..., x_n)$
    
    $= \text{max}_\theta P(x_1).P(x_2).P(x_3)...P(x_n)$

    $= \text{max}_\theta ∏_i^n P(x_i)$

<img src='https://drive.google.com/uc?id=1sgY8Z4HbaHV-w-esd_7aTmTfGV0Cv7SD'>

For a **K-Gaussian Mixture Model**, the problem can be formulated as follows:

<img src='https://drive.google.com/uc?id=1OPcl2RqtvebUgaNfAbU-myp4-g7cfJ5H'>

- This is something similar to **Conditional Probability**, where we have **Cluster Priories** (similar to Class Priories) and **Likelihood of generating a data point $x_i$ using $j^{th}$ Gaussian**.

<img src='https://drive.google.com/uc?id=11fpuSLONlp8XITRHgMR3N6k6PGww8-cH'>





***

## Maximum Likelihood Estimation: Algorithm

### **This optimization problem is solved with the help of Expectation Maximization**

- **Expectation Maximization is essentially a 2-step process: Expectation or E step and Maximization or M step**.


- First, we choose starting guesses for the location and shape of the Gaussians, which is fixed.

- Then, we repeat until convergence:

    - **E-step:** For each point, find weights encoding the probability of membership in each cluster.
    
    - **M-step:** For each cluster, update its location, normalization, and shape based on all data points, making use of the weights.

<img src='https://drive.google.com/uc?id=12HZqAmGODyZZ9wHmrGUPTpWznqJf4cTm'>




#### **Let's expand on the two steps of Expectation Maximization bit more**

- Let's say we are given $n$ data points: $x_i$'s where $i$ ranges from 1 to $n$.

- Our ultimate goal is to determine which Gaussian (Cluster) does each $x_i$ belongs to.

<img src='https://drive.google.com/uc?id=1oI-l6oUQyIW-jKDDMkOjdiqXRsn76e8a'>

1. **Expectation (E) Step** - For each $x_i$, we compute the probability of it belonging to $j^{th}$ cluster.

<img src='https://drive.google.com/uc?id=1H9OVGASNCuVDVcoMiFombHyoeVRuYI4J'>

<img src='https://drive.google.com/uc?id=1R2YMk171T5x8As24z3nrO4zbmg67NpLH'>



2. **Maximization (M) Step** - Re-estimate the Gaussian parameters for all Gaussians.

<img src='https://drive.google.com/uc?id=1LwLrPC2A1j87bVyKozr01Ir_W-2NZqJQ'>








### **Expectation Maximization (EM) is a Coordinate Ascent**

- It is **Ascent** because it is a Maximization problem.

- In **Coordinate Ascent**, in order to reach the optimal point, we increase 1 parameter while all other parameters are fixed. This is then followed by increasing another parameter while keeping all other parameters fixed.

- We do this ascent till we reach the optimal value of the combination of parameters.

<img src='https://drive.google.com/uc?id=1GT3KKznEezx5e31PGXmzt91g5dx1dEQO'>


***


## Visualization of GMM

- As we mentioned earlier, Gaussian Mixture Models are probabilistic models that assume that the instances were generated from a mixture of several Gaussian Distributions whose parameters are unknown.

- All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid.

- Each cluster is modelled according to a different Gaussian distribution, with different Mean $\mu_j$ and Co-Variance $\epsilon_j$.

- Each cluster can have a different ellipsoidal shape, size, density, orientation and weights relative to each other.

- When we observe an instance, we know it was generated from one of the Gaussian distributions, but we are not told which one, and we do not know what the parameters of these distributions are.

**Note:** The computational complexity of training a GaussianMixture model depends on the number of instances `n`, the number of dimensions `d`, the number of clusters `K`, and the constraints on the Covariance Matrices.

- Each cluster is modelled according to a different Gaussian distribution. Mathematically we can define GMM as mixture of K-Gaussian Distributions, that means it’s a weighted average of K Gaussian Distribution.


- Let's say we have 3 Clusters (Gaussians) to begin with - Red, Green and Blue contours that you see below:

<img src='https://drive.google.com/uc?id=1nY-WkObSQ2t3YMT5hLzBA8q5aI6ZzQc5'>

- The small circles represent the data points.

- Each circle has portions of Red, Green and Blue colors, where each colored portion represents the initial probability of the data point belonging to each of the clusters.

- For example, if a data point is more Red, it has higher probability of belonging to Red cluster, and if it is more Blue, it will have higher probability of belonging to Blue cluster, ... and so on. And if a data point has equal portions of all three colors, it has equal probability of belonging to each of the 3 clusters.

- In the beginning, the vector of Means $\mu_d$ and Covariance Matrix $\sum_{d \times d}$ are constructed at random.


- If there are $n$ random variables $Y(i)$ (from $Y(1)$ to $Y(n)$) and $n$ random variables $x_i$. There are also $K$ means $\mu_j$ and $K$ covariance matrices $Σ_j$. Lastly, there is just one weight vector $\theta$ (containing all the weights $\theta_1$ to $\theta_k$).

- Each variable $Y(i)$ is drawn from the categorical distribution with weights $\theta$. Each variable $x_i$ is drawn from the normal distribution, with the mean and covariance matrix defined by its cluster $Y(i)$.


As shown in below picture, after the first iteration of Expectation and Maximization Step, the Green Gaussian Distribution starts moving towards first set of points, Red Gaussian towards second set of points and Blue Gaussian towards third set of points.

<img src='https://drive.google.com/uc?id=1DaLsOyjG2mplbey_ySTa0RTZ4r6X7oSx'>

#### **The Maximization/Optimization Function $\text{max}_\theta ∏_i^n P(x_i)$ is updated at each iteration for each of the K-Gaussians in order to reach the optimal (maximum in this) value of this Optimization Function**.

- For each iteration, during the **Expectation Step**, the algorithm estimates the probability that it belongs to each cluster (based on the current cluster parameters).

- Then, during the **Maximization Step**, each cluster is updated using all the instances in the dataset, with each instance weighted by the estimated probability that it belongs to that cluster. These probabilities are also called the **responsibilities** of the clusters for the iteration.

- During the **Maximization Step**, each cluster’s update will mostly be impacted by the iteration it is most responsible for.


- After a few more iterations of E and M steps, the clusters start becoming much more clear. The probability of each data point belonging to a specific cluster starts increasing, and its probability of belonging to other clusters start decreasing.

<img src='https://drive.google.com/uc?id=1OMNPRxXvxs1JSjQLsPaMQgF5T6If_-qd'>

- After more iterations, see how the clusters become well-formed.

- The probability of each data point belonging to a specific cluster becomes very high, and the probability of it belonging to other clusters becomes extremely low (but non-zero). Hence, it's a **soft assignment**.

<img src='https://drive.google.com/uc?id=1Fyd1uZFnOq8yQZD5FqWRws4corcL0RuH'>

- As we can see, each colored cluster is represented by its respective contour.

- We can think of EM as a generalization of K-Means that not only finds the cluster centers $(μ_1$ to $μ_k)$, but also their size, shape, and orientation $(Σ_1$ to $Σ_k)$, as well as their relative weights $(\theta_1$ to $\theta_k)$.

- That is why we said that, unlike K-Means, Expectation Maximization uses **soft cluster assignments**, not hard assignments.


So, this is how Expectation Maximization works to identify the shape of Clusters in a Gaussian Mixture Model, as well as grouping the data points into those Clusters.

***

## Advantages and Disadvantages of GMMs and Expectation Maximization

### **Advantages**

- The basic two steps of the EM algorithm i.e, E-step and M-step are often pretty easy for many of the machine learning problems in terms of implementation.

- The solution to the M-step often exists in the closed-form.

- It is always guaranteed that the value of likelihood will increase after each iteration.

- **Speed:** It is the fastest algorithm for learning mixture models.

- **Agnostic:** As this algorithm maximizes only the likelihood, it will NOT bias the means towards zero, or bias the cluster sizes to have specific structures that might or might not apply.

### **Disadvantages**

- It has slow convergence.

- It is sensitive to starting point, converges to the local optimum only.

- It cannot discover K (likelihood keeps growing with number of clusters).

- It takes both forward and backward probabilities into account. This thing is in contrast to that of numerical optimization which considers only forward probabilities.

- **Singularities:** When one has insufficiently many points per mixture, estimating the covariance matrices becomes difficult, and the algorithm is known to diverge and find solutions with infinite likelihood unless one regularizes the covariances artificially.

- **Number of components:** This algorithm will always use all the components it has access to, needing held-out data or information theoretical criteria to decide how many components to use in the absence of external cues.

<img src='https://drive.google.com/uc?id=1B6TuTHzM5NKwggEIFpCkadDnX-XilVhB'>

That is why Gaussian Mixture Models are used less in practice.




***