In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.color_palette()
sns.set(font_scale=1.75)
sns.set_style("white")

from graspologic.cluster import DivisiveCluster

Hierarchical clustering is similar to the clustering algorithms introduced above like AutoGMM and K-Means but it leads to a hierarchy of clusters. Two major types of hierarchical clustering algorithms are agglomerative and divisive. The former one starts from every data point in its own cluster and gradually merges cluters in a "bottom-up" fashion; the latter one assumes all data points in the same cluster initially and gradually divides it in a "top-down" fashion.

This DivisiveCluster algorithm implements hierarchical clustering in a “divisive” approach based on a chosen clustering algorithm such as AutoGMM. It retrieves predictions on the full dataset from the chosen clustering algorithm, say AutoGMM, and passes each subset of data corresponding to a predicted cluster onto AutoGMM again while specifying min_components=1. If the best model computed by AutoGMM for any predicted cluster leads to more than one subcluster, each of the subclusters will be clustered recursively as described above; otherwise, that subcluster becomes a leaf cluster. The algorithm terminates when all branches of recursive clustering have led to a set of leaf clusters.

Consider the following synthetic hierarchical data made up of two levels of four Gaussian distributions in 1D. Each Gaussian distribution has a standard deviation of 0.1. The 4 means are symmetric about 0; the smallest 2 and largest 2 means are symmetric about -2.5 and 2.5, respectively. Hence, this dataset can be classified into 4 clusters of 1 Gaussian component or 2 clusters of Gaussian mixtures of 4 components. Those are the two clustering hierarchies of increasing granularity.

In [None]:
# generate synethetic data

np.random.seed(1)

n = 100  # number of data points
d = 3  # number of dimensions

# Let Xij denote the ith Gaussian mixture component in the jth cluster at the lowest hierarchy
X11 = np.random.normal(-3, 0.5, size=(n, d))
X21 = np.random.normal(-2, 0.5, size=(n, d))
X12 = np.random.normal(2, 0.5, size=(n, d))
X22 = np.random.normal(3, 0.5, size=(n, d))
X = np.vstack((X11, X21, X12, X22))

# true label at either level
y_lvl1 = np.repeat([0, 1], 2 * n)
y_lvl2 = np.repeat([0, 1, 2, 3], n)

In [None]:
np.random.seed(1)

# fit model and predict on data
dc = DivisiveCluster(max_components=2, cluster_method="gmm")

# enable "fcluster" to return a set of flat clusterings
pred = dc.fit_predict(X, fcluster=True)

### visualize clustering dendrogram

Hierarchical dendrogram or tree is a way to represent clusters level by level (from the root to the leaves). We will plot out hierarchical clustering results such that each cluster at each level is denoted by a unique color and each node in the dendrogram is colored by its predicted cluster at each level. Since the root level contains only 1 cluster which is trivial, we will show levels below the root.

In [None]:
# reorder the labels so that the clusters in each array recieve increasingly-indexed labels
def relabel(pred):
    for i in range(pred.shape[1]):
        temp = pred[:,i].copy()
        _, index = np.unique(temp, return_index=True)
        # return unique labels in the order of their appearance
        uni_labels = temp[np.sort(index)]
        for label, ul in enumerate(uni_labels):
            inds = temp == ul
            temp[inds] = -label-1
        pred[:,i] = -(temp+1)
        
    return pred

In [None]:
# function to plot hierarchical clustering assignments

def plot(labels, n_level, title):
    fig,axs = plt.subplots(n_level,1, figsize=(20, n_level+1.5), sharex=True, sharey=True)
    for i in range(n_level): 
        ax = axs[i]
        sns.heatmap(labels[:, i].reshape((1,-1))+1, cbar=False, xticklabels=100, yticklabels="", center=0, cmap='RdBu_r', ax=ax)
        if i < n_level-1:
            ax.set(xticklabels='')
        else:
            ax.set_xlabel("Node Index")
        ax.set_ylabel(i+1)
    fig.text(0.1, 0.5, "Level", rotation=90, va="center", ha="center")
    fig.suptitle(title)

In [None]:
# true clustering dendrogram

y_true = np.vstack((y_lvl1, y_lvl2)).T
n_level = 2

plot(y_true, n_level, "True Clustering Dendrogram")

In [None]:
# estimated clustering dendrogram

pred = relabel(pred)
n_level = pred.shape[1]

plot(pred, n_level, "Estimated Clustering Dendrogram")