Skip to content
mstrosaker edited this page Nov 21, 2013 · 3 revisions

Overview

Hierarchical clustering provides the ability to perform cluster analysis by constructing a series of clusters in a hierarchical fashion. The hclust package provides two classes in support of this objective:

  • DistanceMatrix represents the data used to construct the clusters
  • HClust constructs the clusters, and provides the ability to retrieve them

The algorithm is implemented is a bottom-up (agglomerative) clustering approach. Different linkage criteria may be specified.

The following snippet explains the most common usage scenario in a nutshell:

from hclust import DistanceMatrix, HClust

with open('somefile.dist', 'rb') as f:
    dmat = DistanceMatrix(f)

hc = HClust(dmat)

Once an HClust object is created, the clusters can be obtained several different ways, detailed below.

Distance Matrices

A DistanceMatrix object must be created and passed to the HClust constructor. The DistanceMatrix constructor takes a file-like stream as a parameter; this stream contains data in the following format:

        Turtle  Human   Tuna    Chicken Moth    Monkey  
Human   19.00  
Tuna    27.00   31.00  
Chicken 8.00    18.00   26.00  
Moth    33.00   36.00   41.00   31.00  
Monkey  18.00   1.00    32.00   17.00   35.00  
Dog     13.00   13.00   29.00   14.00   28.00   12.00

There are a total of 7 observations in this distance matrix. Note that there is no row corresponding to the first observation, and no column corresponding to the last. The entries in each line must be tab-delimited. This matrix may be read in with dmat = DistanceMatrix(stream).

An alternative approach is to use a full matrix, as in the following:

Title   Turtle  Human   Tuna    Chicken Moth    Monkey    Dog  
Turtle  0.00    19.00   27.00   8.00    33.00   18.00   13.00  
Human   19.00   0.00    31.00   18.00   36.00   1.00    13.00  
Tuna    27.00   31.00   0.00    26.00   41.00   32.00   29.00  
Chicken 8.00    18.00   26.00   0.00    31.00   17.00   14.00  
Moth    33.00   36.00   41.00   31.00   0.00    35.00   28.00  
Monkey  18.00   1.00    32.00   17.00   35.00   0.00    12.00  
Dog     13.00   13.00   29.00   14.00   28.00   12.00   0.00

This will produce the same results as the prior matrix. Note that this full matrix is symmetrical across the diagonal; its bottom half (below the diagonal) is identical to the first presented matrix. Again, the entries in each line must be tab-delimited. This matrix may be read in as dmat = DistanceMatrix(stream, full=True); note that the full parameter must be specified as True in this case.

It is typical to store these matrices in files. This approach was selected as it allows easier interoperability with the R programming language.

HClust Objects

A DistanceMatrix object is passed to the HClust constructor. A linkage criterion may also be specified with the linkage_criterion parameter:

  • 'average' (default): mean (average) linkage clustering, also known as UPGMA
  • 'max': maximum, known as complete linkage clustering
  • 'min': minimum, known as single-linkage clustering

The following examples use an HClust object created with the above sample distance matrices. The clusters can be obtained through the following techniques:

Cutting at a specified depth

hclust.cut(5) produces a list of 5 tuples, each representing a cluster: [('Chicken', 'Turtle'), ('Monkey', 'Human'), ('Tuna',), ('Moth',), ('Dog',)]

hclust.cut(10) produces a list of 3 tuples, each representing a cluster: [('Dog', 'Monkey', 'Human', 'Chicken', 'Turtle'), ('Tuna',), ('Moth',)]

Retrieving a specified number of clusters

hc.n_clusters(4) produces a list of 4 tuples: [('Tuna',), ('Moth',), ('Chicken', 'Turtle'), ('Dog', 'Monkey', 'Human')]

hc.n_clusters(2) produces ``[('Moth',), ('Tuna', 'Dog', 'Monkey', 'Human', 'Chicken', 'Turtle')]`, indicating that the Moth observation is the most dissimilar from the others.

Retrieving the entire list of clusters

cluster_list = hc.clusters retrieves a list of hclust.Node objects. The Node objects contain properties for the node id, depth, parent, and children, allowing the dendrogram to be reconstructed in its entirety.

Clone this wiki locally