# Gaussian Mixture Models (Clustering)

In [2]:
from utils import *

**Context:** Sometimes our data contains hidden structure---structure that we'd like to uncover in order to answer some scientific question. For example, recall the data set we analyzed in the unit on continuous probability from IHH’s Center for Telekinesis Research (CTR). The researchers at the IHH’s CTR study the propensity of intergalactic beings for telekinesis—the ability. They were interested in understanding how different physiological conditions affect a being’s telekinetic abilities. That is, they observed each patient's telekinetic-ability and wanted to understand how it related to some underlying condition (allergic reaction, intoxication, and entangled antennas). In their specific case, their data did contain the underlying condition. However, often times, our data doesn't contain this information. In such cases, our goal is to *uncover* the underlying types of patients. Doing so may help identify patients that benefit from different treatments. For example, in addition to each patient's underlying physiological condition, their telekinetic ability could have been impacted by environmental factors growing up, their genetics, etc. It's hard to know a priori which of these factors are truly important, so its not worth investing in collecting all this data (which is expensive). 

**Challenge:** But how can we possibly uncover a variable that's not in the data? By making assumptions about the distribution of this variable, as well as how it relates to the other variables in the data, we can! In statistical lingo, such *unobserved* variables are called *latent* variables. As we will show here, there's only one rule of probability we need to learn in order to use our existing toolkit to model latent variables. 

**Outline:** 
* Introduce latent variable models, as well as our first latent variable model (LVM)---the Gaussian Mixture Model (GMM)
* Introduce the law of total probability (in the descrete case), which will allow us to compute the MLE for LVMs
* Compute the MLE for the GMM
* Implement a GMM in `NumPyro`

**Data:** We will start by modeling the data introduced in the chapter on continuous probability. The data includes two variables---the patient's telekinetic ability, and their underlying condition. We will *pretend* that we did not observe their underlying condition. Our goal will then be to *infer* it given their telekinetic ability. Let's remind ourselves what the data looks like:

In [1]:
# Import a bunch of libraries we'll be using below
import pandas as pd
import matplotlib.pylab as plt

# Load the data into a pandas dataframe
csv_fname = 'data/IHH-CTR.csv'
data = pd.read_csv(csv_fname, index_col='Patient ID')

# Print a random sample of patients, just to see what's in the data
data.sample(15, random_state=0)

Unnamed: 0_level_0,Condition,Telekinetic-Ability
Patient ID,Unnamed: 1_level_1,Unnamed: 2_level_1
398,Allergic Reaction,0.510423
3833,Allergic Reaction,0.47996
4836,Intoxication,2.043218
4572,Allergic Reaction,-0.443333
636,Intoxication,1.42319
2545,Intoxication,1.392568
1161,Intoxication,2.110151
2230,Intoxication,2.102866
148,Intoxication,1.865081
2530,Allergic Reaction,0.401414


## Latent Variable Models (LVMs)

**Overview.** Latent variable models allow us to model variables we actually did not observe. We will do this using the very same toolkit we've used so far; we'll write down a joint distribution for all variables---observed and latent---as well as a directed graphical model. We will then perform MLE on the resultant model. We will instantiate everything with a specific model---Gaussian Mixture Model (GMM)---which will help us find patient types in the above IHH data.

**Gaussian Mixture Models (GMMs).** In our IHH example, we assume the patients' underlying type in some way "explains" their observed data. We can encode this into a model by saying that each patient's data is generated by:
1. Sampling their *patient type* from some distribution. We'll call the latent type $z$, and assume it's drawn from a Categorical distribution with parameter $\pi$:
    \begin{align}
    z_n &\sim \mathrm{Cat}(\pi)
    \end{align}
2. Given the type, we can now sample the *observed data*, $x$. As the name suggests, we'll set this distribution to be a Gaussian. The Gaussian's mean and variance will be selected by $z_n$. By this we mean that if the patient has underlying type 1 (i.e. $z_n = 1$), then their observed data is sampled from $\mathcal{N}(\mu_1, \sigma^2_1)$. Similarly, if they have underlying type 2, their observed data is sampled from $\mathcal{N}(\mu_2, \sigma^2_2)$. Putting this together, we have:
    \begin{align}
    x_n | z_n &\sim \mathcal{N}(\mu_{z_n}, \sigma^2_{z_n})
    \end{align}

For 1-dimensional $x$, the final data-generating process is then:
\begin{align}
z_n &\sim p_Z(\cdot; \pi) = \mathrm{Cat}(\pi) \quad (\text{mixture})\\
x_n | z_n &\sim p_{X | Z}(\cdot | x_n; \mu_0, \dots, \mu_{K-1}, \sigma_0, \dots, \sigma_{K-1}) = \mathcal{N}(\mu_{z_n}, \sigma^2_{z_n}) \quad (\text{components})
\end{align}
The distribution over the latent variable is often called the "mixture," and each Gaussian is called a "component" in the mixture. From here on, we'll use $\theta = \{ \pi, \mu_0, \dots, \mu_{K-1}, \sigma_0, \dots, \sigma_{K-1}\}$ to refer to the model's parameters.

**Directed Graphical Model.** Graphically, we can depict a GMM as follows:

<div class="canva-centered-embedding">
  <div class="canva-iframe-container">
    <iframe loading="lazy" class="canva-iframe"
      src="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAGIs1vP9i0&#x2F;-IWykjjF-dWy5DOBnqfudA&#x2F;view?embed">
    </iframe>
  </div>
</div>

As you can see, our observed data, $x_n$, depends on the latent patient type, $z_n$. Since $z_n$ is not observed, *its circle is left white* (i.e. not shaded in).

**What are GMMs useful for?** To better understand what GMMs are useful for, let's visualize them. Here's an example GMM:

```{figure} _static/figs/example_1d_gmm.png
---
name: fig-gmm-1d
align: center
---
The PDF of a GMM's mixture components (left) and data marginal (right).
```

On the left, you can see the GMM's mixture components (i.e. each Gaussian), and on the right, you can see the probability of the *observed* data, $p_X(\cdot; \theta)$. Looking at the above figure, you can see two things:
1. *Clustering.* Looking at the left plot, you can see that GMMs can "cluster" the observed data; every observation likely belongs to one of three Gaussians---we just need to figure out which observation belongs to which cluster.
2. *Complicated Distributions.* Looking at the right plot, you can see that GMMs can describe more complicated distributions. Unlikely the continuous distributions we've used so far, which all have one mode (or one "bump"), using a GMM we can easily describe a distribution with multiple bumps. 

**Challenges Deriving the MLE for GMMs.** Now that we have our directed graphical model and our data generating process, we can try to derive the MLE for GMMs. Unfortunately, as you will see, we'll run into some issues. Then our joint data likelihood (which we'd like to maximize) is:
\begin{align}
p(\mathcal{D}; \theta) &= \prod\limits_{n=1}^N p(\mathcal{D}_n; \theta) \\
&= \prod\limits_{n=1}^N p_X(x_n; \theta)
\end{align}
Looking at the above, what is $p_X(x_n; \theta)$? Our data-generating process gives us the following joint distribution:
\begin{align}
p_{X | Z}(x_n, z_n; \theta) &= p_{X | Z}(x_n | z_n; \theta) \cdot p_Z(z_n; \theta)
\end{align}
Somehow, we need to compute $p_X(x_n; \theta)$ from $p_{X | Z}(x_n, z_n; \theta)$. 

As we will show next, we can compute $p_X(x_n; \theta)$ as follows:
\begin{align}
p_X(x_n; \theta) &= \sum\limits_{z_n \in S} p_{X, Z}(x_n, z_n; \theta),
\end{align}
where $S = \{0, \dots, K - 1 \}$ is the support of $Z$, and $K$ is the number of clusters. This formula shows that we can compute $p_X(x_n; \theta)$ by summing the joint over every value of $z_n$. 

## The Law of Total Probability (Discrete)

**Definition.** Suppose you have two random variables, $A$ and $B$, and suppose that $A$ is discrete with support $S$. Then the law of total probability says we can compute the marginal $p_B(b)$ from the joint $p_{A, B}(a, b)$ as follows:
\begin{align}
p_B(b) &= \sum\limits_{a \in S} p_{A, B}(a, b)
\end{align}

**Intuition.** To get intuition, let's depict $A$ and $B$ as follows, each with support $S = \{0, 1\}$:

<div class="canva-centered-embedding">
<div class="canva-iframe-container">
  <iframe loading="lazy" class="canva-iframe"
    src="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAGMdzJ3f3k&#x2F;PhWElQiQSXIIPklC6ED39w&#x2F;view?embed">
  </iframe>
</div>
</div>

In this diagram, each shaded area represents the probability of an event---i.e. area is proportional to probability. The marginal probability of $B = 1$ is therefore the ratio of the blue square relative to the whole space (the gray square):

<div class="canva-centered-embedding">
<div class="canva-iframe-container">
  <iframe loading="lazy" class="canva-iframe"
    src="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAGMdw0ybko&#x2F;3Aap1TJZMrHHD8kPFhfEzQ&#x2F;view?embed">
  </iframe>
</div>
</div>

Using the law of total probability, we can equivalently compute the marginal probability $p_B(1)$ as follows:
\begin{align}
p_B(1) &= \sum\limits_{a \in S} p_{A, B}(a, 1) \\
&= p_{A, B}(0, 1) + p_{A, B}(1, 1) \\
\end{align}
Re-writing this equation visually, we get:

<div class="canva-centered-embedding">
<div class="canva-iframe-container">
  <iframe loading="lazy" class="canva-iframe"
    src="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAGMdzAtxTY&#x2F;406MzXJ_IH4tJuHLwoYFEg&#x2F;view?embed">
  </iframe>
</div>
</div>

As you can see from the diagram, this formula holds. Now let's add to our pictoral intuition by assigning meaning to $A$ and $B$. Suppose $B = 1$ is the event in which a patient has pneumonia ($B = 0$ implies they don't have pneumonia), and suppose that $A = 1$ is the event of rain. We can attribute meaning to the law of total probability as follows:
\begin{align}
\underbrace{p_B(1)}_{\text{has pneumonia}} &= \underbrace{p_{A, B}(0, 1)}_{\text{has pneumonia and no rain}} + \underbrace{p_{A, B}(1, 1)}_{\text{has pneumonia and rain}}
\end{align}
Looking at this formula, we want to aggregate the probability of a patient having pneumonia across all possible scenarios---rain or no rain. 

## Maximum Likelihood for GMMs

**MLE Objective.** Using the law of total probability, we can now write our MLE objective:
\begin{align}
\theta^\text{MLE} &= \mathrm{argmax}_\theta \log p(\mathcal{D}; \theta) \\
&= \mathrm{argmax}_\theta \log \prod\limits_{n=1}^N p(\mathcal{D}_n; \theta) \\
&= \mathrm{argmax}_\theta \sum\limits_{n=1}^N \log p(\mathcal{D}_n; \theta) \\
&= \mathrm{argmax}_\theta \sum\limits_{n=1}^N \log p(x_n; \theta) \\
&= \mathrm{argmax}_\theta \sum\limits_{n=1}^N \log \sum\limits_{z_n \in S} p(x_n, z_n; \theta) \\
\end{align}

**Intractability.**

**Expectations.**

## Multivariate GMMs

## GMMs in `NumPyro`

* Ask to generate samples
* Ask to infer clusters