# Probability-based Clustering

So far, the clustering methods that we have discussed results in hard clutering or partitioning: each point belongs to only one cluster. In this notebook, we will look at algorithms that perform soft clustering: a point's membership to each cluster is described by a certain probability or weighting.

## EM: Expectation-Maximization Algorithm

The general approach for probability-based clustering is to assume that clusters have a certain characteristic defined by their generative models. We then fit the generative models and their weights until it best fits the observed data. The simplest way to do this is by using the expectation-maximization (EM) algorithm.

The EM algorithm finds the maximum likelihood estimators in latent variable models. There are only two steps in EM algorithm as follows:
1. **E-Step:** Estimate the missing variables in the dataset.
2. **M-Step:** Maximize the parameters of the model in the presence of the data.

The most popular method for probability-clustering is by the use of mixture models. We will discuss Gausian mixture model in this notebook.

## Gaussian Mixture Model

A mixture model consists of an unspecified combination of multiple probability distribution functions. Here, the learning algorithm estimates the parameters of the probability distributions to best fit the density of a given training dataset.

So for the Gaussian Mixture Model (GMM), it uses a combination of Gaussian (Normal) probability distributions that requires the estimation of the mean and standard deviation parameters for each. In estimating the parameters, the most common method is the *maximum likelihood estimate*.

Consider a dataset whose points happen to be generated by **two different processes**. The points for each process have a Gaussian probability distribution, but the data is combined and the distributions are very similar that it is **not obvious** to which distribution a point may belong.

The processes used to generate the data point represents a **latent variable**, e.g. process 0 and process 1. Latent variable influences the data but is not observable. For this case, the EM algorithm is appropriate to estimate the parameters of the distributions.

In the EM algorithm, the estimation-step would estimate the latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions that best captures the **density of the data**. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.

1. **E-Step:** Estimate the expected value for each latent variable.
2. **M-Step:** Optimize the parameters of the distribution using maximum likelihood.
3. Repeat until a good set best fitst the data.

Here is a diagram showing the covariance matrix and gaussian formula in one-dimension.
<img src = 'figures/cov_gmm.PNG' width= 300>

### Mathematics of GMM

GMM is a soft clustering algorithm where each cluster corresponds to a generative model that aims to discover the parameters of a probability distribution (e.g., mean, covariance, density function…) for a given cluster (its own probability distribution governs each cluster). 

$P(y_1,...,y_n|x_1,...,x_n, \theta) = P(x_1,...,x_n, y_1,...,y_n|\theta)(joint) / P(x_1,...,x_n|\theta)$

The process of learning is to fit a gaussian model to the data points using maximum likelihood estimation. The Gaussian Mixture model assumes that the clusters are distributed in a normal distribution in n-dimensional space. Here, the purpose is to find a parameter θ that maximized the probability of the observed data.

$\theta_{ML} = \underset{\theta}{\arg\max} P(x_1,...,x_n|\theta))$

The goal is to compute the conditional distribution of the latent attributes given the observed dataset.
$P(x_{n+1}, y_{n+1}|x_1,...,x_n, \theta)$

Finally, the algorithm finds a class that maximizes the probability of the future data given the learned parameters $\theta$:
$\underset{c}{\arg\max} P(x_{n+1}|\theta_c))$


## Implementation
We start off by demonstrating the advantage of GMM over k-means.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

In [None]:
# Generate some data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
                       cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting

Weakness of k-means:
1. Non-probabilistic nature of k-means and 
2. Use of simple distance-from-cluster-center to assign cluster membership leads to poor performance for many real-world situations.



In [None]:
# Plot the data with K Means Labels
from sklearn.cluster import KMeans
kmeans = KMeans(4, random_state=0)
labels = kmeans.fit(X).predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis');

From the example above, we may not be very confident about the result as there appears to be a  slight overlap between the two middle clusters. One of the motivations of GM model is the fact that $k$-means clustering has no intrinsic measure of probability or uncertainty of cluster assignments.

One way to think about the $k$-means model is that it places a circle (in 2D) at the center of each cluster whose radius is set by the most distant point in the cluster and becomes the cut-off for cluster assignment within the training set. Any point outside this circle is not considered a member of the cluster. Let us visualize it.

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    print(centers.shape, centers)
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

In [None]:
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X)

But what if we transform the data? Can k-means still cluster it well?

In [None]:
rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X_stretched)

We see that circular k-means cannot cluster the data well at the lower right area and that the circular cut-offs overlap.

These two disadvantages of k-means—its lack of flexibility in cluster shape and lack of probabilistic cluster assignment— resulting to poor clustering.

To address this issue we can allow: (1) uncertainty in cluster assignment by comparing the distances of each point to all cluster centers and (2) cluster boundaries to be ellipses. These two are the essential components of a different type of Gaussian mixture models.

A Gaussian mixture model (GMM) attempts to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. In the simplest case, GMMs can be used for finding clusters in the same manner as k-means:

In [None]:
from sklearn import mixture
model = mixture.GaussianMixture(n_components=4, covariance_type='full')

gmm = model.fit(X)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis');

GMM also allows us to  find probabilistic cluster assignments, which measures the probability that any point belongs to the given cluster.

In [None]:
probs = gmm.predict_proba(X)
print(probs[:5].round(3)) # [n_samples, n_clusters]

We can visualize this uncertainty (size of each point~certainty of its prediction). The points at the boundaries between clusters reflect the uncertainty of cluster assignment.

In [None]:
size = 50 * probs.max(1) ** 2  # square emphasizes differences
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=size);

Recall:

1. E-step: for each point, find weights encoding the probability of membership in each cluster
2. M-step: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights. The result of this step is each cluster associated not with a hard-edged sphere, but with a smooth Gaussian model. Just as in the k-means expectation–maximization approach, this algorithm can sometimes miss the globally optimal solution, and thus in practice multiple random initializations are used.

Let's create a function that will help us visualize the locations and shapes of the GMM clusters by drawing ellipses based on the GMM output:

In [None]:
from matplotlib.patches import Ellipse

def draw_ellipse(position, covariance, ax=None, **kwargs):
    """Draw an ellipse with a given position and covariance"""
    ax = ax or plt.gca()
    
    # Convert covariance to principal axes
    if covariance.shape == (2, 2):
        U, s, Vt = np.linalg.svd(covariance)
        angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
        width, height = 2 * np.sqrt(s)
    else:
        angle = 0
        width, height = 2 * np.sqrt(covariance)
    
    # Draw the Ellipse
    for nsig in range(1, 4):
        ax.add_patch(Ellipse(position, nsig * width, nsig * height,
                             angle, **kwargs))
        
def plot_gmm(gmm, X, label=True, ax=None):
    ax = ax or plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        draw_ellipse(pos, covar, alpha=w * w_factor)

In [None]:
gmm = model.fit(X)
plot_gmm(gmm, X)

We also demonstrate that GMM is efficient in clustering transformed data sets unlike k-means.

In [None]:
plot_gmm(gmm, X_stretched)

**Exercise 6 [3 pts]**

Perform GMM clustering on the `pickup_longitude`, `pickup_latitude`, `dropoff_longitude` and `dropoff_latitude` of `/mnt/data/public/nyctaxi/trip_data/trip_data_4.csv`. Clean data as appropriate. Justify choice of hyperparameters.

**Exercise 7 [2 pts]**

Compare the results of the previous exercises. Based on these, what are the advantages and disadvantages of GMM.

## References

* https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html
* https://towardsdatascience.com/17-clustering-algorithms-used-in-data-science-mining-49dbfa5bf69a#a536
* https://machinelearningmastery.com/expectation-maximization-em-algorithm/