-
Notifications
You must be signed in to change notification settings - Fork 1
User's Guide
Hierarchical clustering provides the ability to perform cluster analysis by constructing a series of clusters in a hierarchical fashion. The hclust package provides two classes in support of this objective:
- DistanceMatrix represents the data used to construct the clusters
- HClust constructs the clusters, and provides the ability to retrieve them
The algorithm is implemented is a bottom-up (agglomerative) clustering approach. Different linkage criteria may be specified.
The following snippet explains the most common usage scenario in a nutshell:
from hclust import DistanceMatrix, HClust
with open('somefile.dist', 'rb') as f:
dmat = DistanceMatrix(f)
hc = HClust(dmat)
Once an HClust object is created, the clusters can be obtained several different ways, detailed below.
A DistanceMatrix object must be created and passed to the HClust constructor. The DistanceMatrix constructor takes a file-like stream as a parameter; this stream contains data in the following format:
Turtle Human Tuna Chicken Moth Monkey
Human 19.00
Tuna 27.00 31.00
Chicken 8.00 18.00 26.00
Moth 33.00 36.00 41.00 31.00
Monkey 18.00 1.00 32.00 17.00 35.00
Dog 13.00 13.00 29.00 14.00 28.00 12.00
There are a total of 7 observations in this distance matrix. Note that there is no row corresponding to the first observation, and no column corresponding to the last. The entries in each line must be tab-delimited. This matrix may be read in with dmat = DistanceMatrix(stream)
.
An alternative approach is to use a full matrix, as in the following:
Title Turtle Human Tuna Chicken Moth Monkey Dog
Turtle 0.00 19.00 27.00 8.00 33.00 18.00 13.00
Human 19.00 0.00 31.00 18.00 36.00 1.00 13.00
Tuna 27.00 31.00 0.00 26.00 41.00 32.00 29.00
Chicken 8.00 18.00 26.00 0.00 31.00 17.00 14.00
Moth 33.00 36.00 41.00 31.00 0.00 35.00 28.00
Monkey 18.00 1.00 32.00 17.00 35.00 0.00 12.00
Dog 13.00 13.00 29.00 14.00 28.00 12.00 0.00
This will produce the same results as the prior matrix. Note that this full matrix is symmetrical across the diagonal; its bottom half (below the diagonal) is identical to the first presented matrix. Again, the entries in each line must be tab-delimited. This matrix may be read in as dmat = DistanceMatrix(stream, full=True)
; note that the full
parameter must be specified as True
in this case.
It is typical to store these matrices in files. This approach was selected as it allows easier interoperability with the R programming language.
A DistanceMatrix object is passed to the HClust constructor. A linkage criterion may also be specified with the linkage_criterion
parameter:
-
'average'
(default): mean (average) linkage clustering, also known as UPGMA -
'max'
: maximum, known as complete linkage clustering -
'min'
: minimum, known as single-linkage clustering
The following examples use an HClust object created with the above sample distance matrices. The clusters can be obtained through the following techniques:
Cutting at a specified depth
hclust.cut(5)
produces a list of 5 tuples, each representing a cluster: [('Chicken', 'Turtle'), ('Monkey', 'Human'), ('Tuna',), ('Moth',), ('Dog',)]
hclust.cut(10)
produces a list of 3 tuples, each representing a cluster: [('Dog', 'Monkey', 'Human', 'Chicken', 'Turtle'), ('Tuna',), ('Moth',)]
Retrieving a specified number of clusters
hc.n_clusters(4)
produces a list of 4 tuples: [('Tuna',), ('Moth',), ('Chicken', 'Turtle'), ('Dog', 'Monkey', 'Human')]
hc.n_clusters(2)
produces ``[('Moth',), ('Tuna', 'Dog', 'Monkey', 'Human', 'Chicken', 'Turtle')]`, indicating that the Moth observation is the most dissimilar from the others.
Retrieving the entire list of clusters
cluster_list = hc.clusters
retrieves a list of hclust.Node objects. The Node objects contain properties for the node id
, depth
, parent
, and children
, allowing the dendrogram to be reconstructed in its entirety.