# Bottom-Up Hierarchical Clustering

An alternative approach to clustering is to “grow” clusters from the bottom up. We can do this in the following way:

1. Make each input its own cluster of one.
2. As long as there are multiple clusters remaining, find the two closest clusters and merge them.

At the end, we’ll have one giant cluster containing all the inputs. If we keep track of the merge order, we can recreate any number of clusters by unmerging. For example, if we want three clusters, we can just undo the last two merges.

In [33]:
from typing import NamedTuple, Union
from scratch.linear_algebra import Vector, List

class Leaf(NamedTuple):
    value: Vector

In [34]:
class Merged(NamedTuple):
    children: tuple
    order: int

In [35]:
leaf1 = ([10, 20],) # to make a 1-tuple you need the trailing comma
leaf2 = ([30, -15],) # otherwise Python treats the parentheses as parentheses

merged = (1, [leaf1, leaf2])

def is_leaf(cluster):
    """a cluster is a leaf if it has length 1""" 
    return len(cluster) == 1

def get_children(cluster):
    """returns the two children of this cluster if it's a merged cluster; raises an exception if this is a leaf cluster"""
    if is_leaf(cluster):
        raise TypeError("a leaf cluster has no children") 
    else:
        return cluster[1]

def get_values(cluster):
    """returns the value in this cluster (if it's a leaf cluster) or all the values in the leaf clusters below it (if it's not)""" 
    if is_leaf(cluster):
        return cluster # is already a 1-tuple containing value else:
    return [value for child in get_children(cluster) for value in get_values(child)]

In [37]:
assert get_values(merged) == [[10, 20], [30, -15]]
get_values(merged)

[[10, 20], [30, -15]]

In order to merge the closest clusters, `we need some notion of the distance between clusters`. We’ll use the `minimum distance` between elements of the two clusters, which merges the two clusters that are closest to touching (but will sometimes produce large chain-like clusters that aren’t very tight). If we wanted tight spherical clusters, we might use the `maximum distance` instead, as it merges the two clusters that fit in the smallest ball. Both are common choices, as is the `average distance`: