# Automatic Gaussian Mixture Modeling

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import random
random.seed(0)

Clustering is a foundational data analysis task, where members of the data set are sorted into groups or "clusters" according to measured similarities between the objects. According to some quantitative criteria, members of the same cluster are similar and members of distinct clusters are different.

The Automatic Gaussian Mixture Model (AutoGMM) is a clustering algorithm that uses Sklearn's hierarchical agglomerative clustering and then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM, and cluster numbers are used in the algorithm, and the clustering with the best selection criterion (bic/aic) is chosen.

The Gaussian mixture model (GMM) is a statistical model of clustered data that, simply put, is a composition of multiple normal distributions. Each cluster has a weight $w_k$ assigned to it, and the combined probability distribution, $f(x)$, is of the form:

$f(x) = \sum\limits_{k = 1}^K {w_{k}f_{k}(x)} = \sum\limits_{k = 1}^K {\frac{w_{k}}{(2\pi)^{\frac{d}{2}}|\sum_{k}|^{-\frac{1}{2}}}e^{[\frac{1}{2}(x - \mu_{k})^{T}\sum_{k}^{-1}(x - \mu_{k})]}}$

where $k$ is the total number of clusters and $d$ is the dimensionality of the data.

Expectation Maximization (EM) algorithms are then run to estimate model parameters and the fitted GMM is used to cluster the data.  
$$   
$$  
Let's look at an example using synthetic data

In [None]:
from graspologic.cluster.autogmm import AutoGMMCluster

# Synthetic data

# Dim 1
class_1 = np.random.randn(150, 1)
class_2 = 2 + np.random.randn(150, 1)
dim_1 = np.vstack((class_1, class_2))

# Dim 2
class_3 = np.random.randn(150, 1)
class_4 = 2 + np.random.randn(150, 1)
dim_2 = np.vstack((class_3, class_4))

X = np.hstack((dim_1, dim_2))

# Labels
label_1 = np.zeros((150, 1))
label_2 = 1 + label_1

c = np.vstack((label_1, label_2)).reshape(300,)

# Fit model
autogmm_ = AutoGMMCluster(affinity='all',linkage='all',covariance_type='all')

# Estimated Labels
c_hat_autogmm = autogmm_.fit_predict(X,c)

We can compare our method to the existing implementation of GMM in Sklearn. Our method expands upon the existing Sklearn framework by allowing the user to automatically find the optimal parameters for a Gaussian mixture model and achieve the best clustering possible. In particular, the ideal number of components `n_components` is output by AutoGMM. If we create a GMM model with our parameters using Sklearn, we will see an optimal fit.

In [None]:
from sklearn import mixture
from sklearn.metrics import confusion_matrix
from scipy.optimize import linear_sum_assignment

# Ideal parameters from AutoGMM
n_comp = autogmm_.n_components_

cov = autogmm_.covariance_type_

# Have to provide exact number of optimum components apriori
gmm_ = mixture.GaussianMixture(n_components=n_comp, covariance_type=cov)
gmm_default_ = mixture.GaussianMixture(2)

# Predicted Labels
c_hat_gmm = gmm_.fit_predict(X)
c_hat_gmm_default = gmm_default_.fit_predict(X)

# Function to remap labels for direct comparison
def remap_labels(y_true, y_pred):

    confusion_mat = confusion_matrix(y_true, y_pred)
    row_inds, col_inds = linear_sum_assignment(confusion_mat, maximize=True)
    label_map = dict(zip(col_inds, row_inds))
    remapped_y_pred = np.vectorize(label_map.get)(y_pred)
    
    return remapped_y_pred

# Remap Predicted labels
c_hat_autogmm = remap_labels(c, c_hat_autogmm)
c_hat_gmm = remap_labels(c, c_hat_gmm)
c_hat_gmm_default = remap_labels(c, c_hat_gmm_default)

In [None]:
# Plotting Synthetic clusters
import matplotlib as mp
import seaborn as sns
sns.set()
sns.set_context("talk", font_scale=1.10)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Labelings
c_list = ['red', 'green', 'blue','orange','purple','yellow','gray']

# Figure
fig = plt.figure(figsize=(15,15))

# Synthetic True Clustering
plt.subplot(2, 2, (1,1))
plt.title('True Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c))
plt.scatter(X[:,0],X[:,1],c=c,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic AutoGMM Model
plt.subplot(2, 2, (2,2))
plt.title('AutoGMM Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_autogmm))
plt.scatter(X[:,0],X[:,1],c=c_hat_autogmm,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic Optimal SKlearn Model
plt.subplot(2, 2, (3,3))
plt.title('Optimal SKlearn Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_gmm))
plt.scatter(X[:,0],X[:,1],c=c_hat_gmm,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xlabel('First Dimension',fontsize=24)
plt.ylabel('Second Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic Default SKlearn Model
plt.subplot(2, 2, (4,4))
plt.title('Default SKlearn Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_gmm_default))
plt.scatter(X[:,0],X[:,1],c=c_hat_gmm_default,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()