# Clustering Models

Clustering is a foundational data analysis task, where members of the data set are sorted into groups or "clusters" according to measured similarities between the objects. According to some quantitative criteria, members of the same cluster are similar and members of distinct clusters are different.

In [None]:
import numpy as np
import pandas as pd
import graspy
import matplotlib.pyplot as plt
%matplotlib inline

## Automatic Gaussian Mixture Model (AUTOGMM)

The Automatic Gaussian Mixture Model or AutoGMM is a clustering algorithm that uses Sklearn's hierarchical agglomerative clustering and then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM,and cluster numbers are used in the algorithm, and the clustering with the best selection criterion (bic/aic) is chosen.

Clustering is a foundational data analysis task, where members of the data set are sorted into groups or "clusters" according to measured similarities between the objects. According to some quantitative criteria, members of the same cluster are similar and members of distinct clusters are different. 

This algorithm is a Gaussian mixture model (GMM), a statistical model of clustered data that, simply put, is a composition of multiple normal distributions. Each cluster has a weight $w_k$ assigned to it, and the combined probability distribution, $f(x)$, is of the form:

$f(x) = \sum\limits_{k = 1}^K {w_{k}f_{k}(x)} = \sum\limits_{k = 1}^K {\frac{w_{k}}{(2\pi)^{\frac{d}{2}}|\sum_{k}|^{-\frac{1}{2}}}e^{[\frac{1}{2}(x - \mu_{k})^{T}\sum_{k}^{-1}(x - \mu_{k})]}}$

where $k$ is the total number of clusters and $d$ is the dimensionality of the data.

Expectation Maximization (EM) algorithms are then run to estimate model parameters and the fitted GMM is used to cluster the data.

Let's look at a simple example,  where the algorithm uses all possible forms of clustering on a basic set of ten samples.

In [None]:
from graspy.cluster.autogmm import AutoGMMCluster

# Ex
x = np.identity(10)
AutoGMM = AutoGMMCluster(min_components=3, affinity="all")
AutoGMM.fit(x)

The results and all calculations are presented as a dataframe.

In [None]:
AutoGMM.results_

Here is an example with local synthetic data.

In [None]:
# Creating Synthetic Data
x = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',',skip_header=0)
x = x[:,np.arange(1,x.shape[1])]
c_true = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',', usecols = (0),skip_header=0)

AutoGMM = AutoGMMCluster(min_components=3, affinity="all")
AutoGMM.fit(x)

## K-Means Clustering (kclust)

kclust is a clustering algorithm that finds the optimal model by using all algorithms and calculating the lowest silhouette score from Sklearn.

Here is the same simple example.

In [None]:
from graspy.cluster.kclust import KMeansCluster

# Ex
x = np.identity(10)
KMeansClust = KMeansCluster(max_clusters=5)
KMeansClust.fit(x)

KMeansClust.model_

Here is the same complex example.

In [None]:
# Creating Synthetic Data
x = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',',skip_header=0)
x = x[:,np.arange(1,x.shape[1])]
c_true = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',', usecols = (0),skip_header=0)

KMeansClust.fit(x, c_true)

KMeansClust.model_

## GraspyClust (gclust)

gclust is the last clustering algorithm and it is purely a GMM approach, with no agglomerative clustering.

Last simple example.

In [None]:
from graspy.cluster.gclust import GaussianCluster

# Ex
x = np.identity(10)
GClust = GaussianCluster()
GClust.fit(x)

GClust.model_

Complex example.

In [None]:
# Creating Synthetic Data
x = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',',skip_header=0)
x = x[:,np.arange(1,x.shape[1])]
c_true = np.genfromtxt('/home/caseypw/data/synthetic.csv', delimiter=',', usecols = (0),skip_header=0)

GClust.fit(x, c_true)

GClust.model_