# Automatic Gaussian Mixture Modeling

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import random
random.seed(10)

Clustering is a foundational data analysis task, where members of the data set are sorted into groups or "clusters" according to measured similarities between the objects.

The Automatic Gaussian Mixture Model (AutoGMM) is a clustering algorithm that uses Sklearn's hierarchical agglomerative clustering and then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM, and cluster numbers are used in the algorithm, and the clustering with the best selection criterion, either Bayesian information criterion (bic) or Akaike information criterion (aic), is chosen.

The Gaussian mixture model (GMM) is a statistical model of clustered data that, simply put, is a composition of multiple normal distributions. Each cluster has a weight $w_k$ assigned to it, and the combined probability distribution, $f(x)$, is of the form:

$$f(x) = \sum\limits_{k = 1}^K {w_{k}f_{k}(x)} = \sum\limits_{k = 1}^K {\frac{w_{k}}{(2\pi)^{\frac{d}{2}}\big|\sum_{k}\big|^{-\frac{1}{2}}}e^{[\frac{1}{2}(x - \mu_{k})^{T}\sum_{k}^{-1}(x - \mu_{k})]}}$$

where $k$ is the total number of clusters and $d$ is the dimensionality of the data.

Expectation Maximization (EM) algorithms are then run to estimate model parameters and the fitted GMM is used to cluster the data.

## Using AutoGMM on Synthetic Data

In [None]:
# Synthetic data

# Dim 1
class_1 = np.random.randn(150, 1)
class_2 = 2 + np.random.randn(150, 1)
dim_1 = np.vstack((class_1, class_2))

# Dim 2
class_3 = np.random.randn(150, 1)
class_4 = 2 + np.random.randn(150, 1)
dim_2 = np.vstack((class_3, class_4))

X = np.hstack((dim_1, dim_2))

# Labels
label_1 = np.zeros((150, 1))
label_2 = 1 + label_1

c = np.vstack((label_1, label_2)).reshape(300,)

In the existing implementation of GMM in Sklearn, one has to choose parameters of the model, including number of components, apriori. This can lead to an inaccurate GMM model.

In [None]:
from sklearn import mixture
from sklearn.metrics import confusion_matrix
from scipy.optimize import linear_sum_assignment

# Say user provides inaccurate estimate of number of components
gmm_ = mixture.GaussianMixture(3)

# Predicted Labels
c_hat_gmm = gmm_.fit_predict(X)

# Function to remap labels for direct comparison
def remap_labels(y_true, y_pred):

    confusion_mat = confusion_matrix(y_true, y_pred)
    row_inds, col_inds = linear_sum_assignment(confusion_mat, maximize=True)
    label_map = dict(zip(col_inds, row_inds))
    remapped_y_pred = np.vectorize(label_map.get)(y_pred)
    
    return remapped_y_pred

# Remap Predicted labels
c_hat_gmm = remap_labels(c, c_hat_gmm)

Our method expands upon the existing Sklearn framework by allowing the user to automatically find the optimal parameters for a Gaussian mixture model and achieve the best clustering possible. In particular, the ideal `n_components_` is found by AutoGMM from a range of possible values given by the user.

In [None]:
from graspologic.cluster.autogmm import AutoGMMCluster

# Fit model
autogmm_ = AutoGMMCluster(affinity='all',linkage='all',covariance_type='all')

# Estimated Labels
c_hat_autogmm = autogmm_.fit_predict(X,c)

# Remap Labels for Color Accuracy
c_hat_autogmm = remap_labels(c, c_hat_autogmm)

We can plot the clusters and view the varying accuracies

In [None]:
# Plotting Synthetic clusters
import matplotlib as mp
import seaborn as sns
sns.set()
sns.set_context("talk", font_scale=1.10)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Labelings
c_list = ['red', 'green', 'blue']

# Plotting Function for Clustering
def plot(title_str, c_hat, X):
    plt.title(title_str,fontsize=24,fontweight='bold')
    max_c = int(np.max(c_hat))
    plt.scatter(X[:,0],X[:,1],c=c_hat,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))


fig = plt.figure(figsize=(20,5))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# True Clustering
plt.subplot(1, 3, (1,1))
plot('True Clustering', c, X)
plt.ylabel("2nd Dimension")

# AutoGMM
plt.subplot(1, 3, (2,2))
plot('AutoGMM Clustering', c_hat_autogmm, X)
plt.xlabel("1st Dimension")

# SKlearn
plt.subplot(1, 3, (3,3))
plot('SKlearn Clustering', c_hat_gmm, X)

plt.show()