# 6.0002 Lecture 12: Clustering

**Speaker:** Prof. John Guttag

## Machine Learning Paradigm
- observe set of examples: **training data**
- infer something about process that generated that data
- use inference to make predictions about previously unseen data: **test data**
- *supervised*: given a set of feature/label pairs, find a rule that predicts the label associated with a previously unseen input
- *unsupervised*: given a set of feature vectors (without labels), group them into "natural clusters"

## Clustering is an optimization problem
$$\textrm{variability}(c) = \sum_{e\in c}\textrm{distance}(\textrm{mean}(c), e)^2$$
$$\textrm{dissimilarity}(C) = \sum_{c \in C} \textrm{variability}(c)$$
- why not divide variability by size of cluster?
    - big and bad worse than small and bad
- is optimization problem finding a $C$ that minimizes dissimilarity$(C)$?
    - no, otherwise could put each example in its own cluster
- need a constraint, e.g.
    - minimum distance between clusters
    - number of clusters

## Two popular methods
- hierarchical clustering
- K-means clustering

## Hierarchical Clustering
- 1.) start by assigning each item to a cluster, so that if you have $N$ items, you now have $N$ clusters, each containing just one item.
- 2.) find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one fewer cluster
- 3.) continue the proces until all items are clustered into a single cluster of size $N$
- this is called **agglomerative** clustering
- what does distance mean?

## Linkage metrics
- *single-linkage*: consider the distance between one cluster and another cluster to be equal to the **shortest** distance from any member of one cluster to any member of the other cluster
- *complete-linkage*: consider the distance between one cluster and another cluster to be equal to the **greatest** distance from any member of one cluster to any member of the other cluster
- *average-linkage*: consider the distance between one cluster and another cluster to be equal to the **average** distance from any member of one cluster to any member of the other cluster

## Clustering Algorithms
- hierarchical clustering
    - greedy algorithm
        - makes locally optimal decisions at each point
    - can select number of clusters using dendogram
    - deterministic
    - flexible with respect to linkage criteria
    - slow
        - naive algorithm $n^3$
        - $n^2$ algorithms exist for some linkage criteria
- K-means is a much faster greedy algorithm
    - most useful when you know how many clusters you want

## K-means algorithm
- pseudocode:
    - randomly choose k examples as initial centroids
    - while True:
        - create k clusters by assigning each example to closest centroid
        - compute k new centroids by averaging examples in each cluster
        - if centroids don't change:
            - break
- what is complexity of one iteration?
    - $k\cdot n \cdot d$, where $n$ is the number of points and $d$ time required to compute the distance between a pair of points

## Issues with k-means
- choosing the "wrong" k can lead to strange results
- result can depend upon initial centroids
    - number of iterations
    - even final results
        - greedy algorithm can find different local optimas

## How to choose K
- *a priori* knowledge about application domain
    - there are two kinds of people in the world: $k=2$
    - there are five different types of bacteria: $k=5$
- search for a good $k$
    - try different values of $k$ and evaluate quality of results
    - run hierarchical clustering on subset of data

## Mitigating dependence on initial centroids
- try multiple sets of randomly chosen initial centroids
- select "best" result
- pseudocode:
    - best = kMeans(points)
    - for t in range(numTrials):
        - C = kMeans(points)
        - if dissimilarity(C) < dissimilarity(best):
            - best = C
    - return best

## An example
- many patients with 4 features each
    - heart rate in beats per minute
    - number of past heart attacks
    - age
    - ST elevation (binary)
- outcome (death) based on features
    - probabilistic, not deterministic
    - e.g. older people with multiple heart attacks at higher risk
- cluster, and examine **purity** of clusters relative to outcomes
    - enriched by people who died?

## class example

In [1]:
import pylab

In [2]:
# plot parameters

#set line width
pylab.rcParams['lines.linewidth'] = 6
#set general font size 
pylab.rcParams['font.size'] = 12
#set font size for labels on axes
pylab.rcParams['axes.labelsize'] = 18
#set size of numbers on x-axis
pylab.rcParams['xtick.major.size'] = 5
#set size of numbers on y-axis
pylab.rcParams['ytick.major.size'] = 5
#set size of markers
pylab.rcParams['lines.markersize'] = 10

In [3]:
# Minkowski distance
def minkowskiDist(v1, v2, p):
    # assumes v1 and v2 are equal length arrays of numbers
    dist = 0
    for i in range(len(v1)):
        dist += abs(v1[i] - v2[i])**p
    return dist**(1/p)

In [32]:
class Example(object):
    def __init__(self, name, features, label=None):
        # Assumes features is an array of floats
        self.name = name
        self.features = features
        self.label = label
    
    def dimensionality(self):
        return len(self.features)
    
    def getFeatures(self):
        return self.features[:]

    def getLabel(self):
        return self.label
    
    def getName(self):
        return self.name
    
    def distance(self, other):
        return minkowskiDist(self.features, other.getFeatures(), 2)
    
    def __str__(self):
        return self,name + ':' + str(self.features) + ':'\
                + str(self.label)

## class cluster

In [43]:
class Cluster(object):
    def __init__(self, examples):
        """Assumes examples a non-empty list of Examples"""
        self.examples = examples
        self.centroid = self.computeCentroid()
    
    def update(self, examples):
        """Assumes examples is a non-empty list of Examples
            Replace examples; return amount centroid has changed"""
        oldCentroid = self.centroid
        self.examples = examples
        self.centroid = self.computeCentroid()
        return oldCentroid.distance(self.centroid)
    
    def computeCentroid(self):
        vals = pylab.array([0.0]*self.examples[0].dimensionality())
        for e in self.examples:  # compute mean
            vals += e.getFeatures()
        centroid = Example('centroid', vals/len(self.examples))
        return centroid

    def getCentroid(self):
        return self.centroid
    
    def variability(self):
        totDist = 0
        for e in self.examples:
            totDist += (e.distance(self.centroid))**2
        return totDist
    
    def members(self):
        for e in self.examples:
            yield e
    
    def __str__(self):
        names = []
        for e in self.examples:
            names.append(e.getName)
        names.sort()
        result = 'Cluster with centroid'\
                + str(self.centroid.getFeatures()) + ' contains:\n '
        for e in names: 
            result = result + e + ', '
        return result[:-2]  # remove trailing comma and space
        

## evaluating a clustering

In [44]:
def dissimilarity(clusters):
    """ Assumes clusters a list of clusters
        Returns a measure of the total dissimilarity of the
        clusters in the list"""
    totDist = 0
    for c in clusters:
        totDist += c.variability()
    return totDist

## Patients

In [45]:
import numpy, random

In [46]:
class Patient(Example):
    pass

In [47]:
# z-scaling
def scaleAttrs(vals):
    vals = pylab.array(vals)
    mean = sum(vals) / len(vals)
    sd = numpy.std(vals)
    vals = vals - mean
    return vals/sd

In [48]:
def getData(toScale=False):
    # read in data
    hrList, stElevList, ageList, prevACSList, classList = [],[],[],[],[]
    cardiacData = open('cardiacData.txt', 'r')
    for l in cardiacData:
        l = l.split(',')
        hrList.append(int(l[0]))
        stElevList.append(int(l[1]))
        ageList.append(int(l[2]))
        prevACSList.append(int(l[3]))
        classList.append(int(l[4]))
    if toScale:
        hrList = scaleAttrs(hrList)
        stElevList = scaleAttrs(stElevList)
        ageList = scaleAttrs(ageList)
        prevACSList = scaleAttrs(prevACSList)
    #Build points
    points = []
    for i in range(len(hrList)):
        features = pylab.array([hrList[i], prevACSList[i],\
                                stElevList[i], ageList[i]])
        pIndex = str(i)
        points.append(Patient('P'+ pIndex, features, classList[i]))
    return points

## kmeans

In [53]:
def kmeans(examples, k, verbose=False):
    # get k randomly chosen initial centroids,
    # create cluster for each
    initialCentroids = random.sample(examples, k)
    clusters = []
    for e in initialCentroids:
        clusters.append(Cluster([e]))
    
    # iterate until centroids do not change
    converged = False
    numIterations = 0
    while not converged:
        numIterations += 1
        # create a list containing k distinct empty lists
        newClusters = []
        for i in range(k):
            newClusters.append([])
        
        # associate each example with closest centroid
        for e in examples:
            # find the centroid closest to e
            smallestDistance = e.distance(clusters[0].getCentroid())
            index = 0
            for i in range(1, k):
                distance = e.distance(clusters[i].getCentroid())
                if distance < smallestDistance:
                    smallestDistance = distance
                    index = i
            # add e to the list of examples for appropriate cluster
            newClusters[index].append(e)
        
        for c in newClusters:  # avoid having empty clusters
            if len(c) == 0:
                raise ValueError('Empty Cluster')
        
        # update each cluster; check if a centroid has changed
        converged = True
        for i in range(k):
            if clusters[i].update(newClusters[i]) > 0.0:
                converged = False
        if verbose:
            print('Itertaion #' + str(numIterations))
            for c in clusters:
                print(c)
            print('')  # add blank line
    return clusters

In [60]:
def trykmeans(examples, numClusters, numTrials, verbose = False):
    """Calls kmeans numTrials times and returns the result with the
          lowest dissimilarity"""
    best = kmeans(examples, numClusters, verbose)
    minDissimilarity = dissimilarity(best)
    trial = 1
    while trial < numTrials:
        try:
            clusters = kmeans(examples, numClusters, verbose)
        except ValueError:
            continue #If failed, try again
        currDissimilarity = dissimilarity(clusters)
        if currDissimilarity < minDissimilarity:
            best = clusters
            minDissimilarity = currDissimilarity
        trial += 1
    return best

## examining results

In [61]:
def printClustering(clustering):
    """Assumes: clustering is a sequence of clusters
       Prints information about each cluster
       Returns list of fraction of pos cases in each cluster"""
    posFracs = []
    for c in clustering:
        numPts = 0
        numPos = 0
        for p in c.members():
            numPts += 1
            if p.getLabel() == 1:
                numPos += 1
        fracPos = numPos/numPts
        posFracs.append(fracPos)
        print('Cluster of size', numPts, 'with fraction of positives =',
              round(fracPos, 4))
    return pylab.array(posFracs)

def testClustering(patients, numClusters, seed = 0, numTrials = 5):
    random.seed(seed)
    bestClustering = trykmeans(patients, numClusters, numTrials)
    posFracs = printClustering(bestClustering)
    return posFracs

In [62]:
patients = getData()
for k in (2,):
    print('\n     Test k-means (k = ' + str(k) + ')')
    posFracs = testClustering(patients, k, 2)


     Test k-means (k = 2)
Cluster of size 118 with fraction of positives = 0.3305
Cluster of size 132 with fraction of positives = 0.3333


In [63]:
# now use toScale
patients = getData(True)
for k in (2,):
    print('\n     Test k-means (k = ' + str(k) + ')')
    posFracs = testClustering(patients, k, 2)


     Test k-means (k = 2)
Cluster of size 224 with fraction of positives = 0.2902
Cluster of size 26 with fraction of positives = 0.6923


## How many positives are there?

In [64]:
numPos = 0
for p in patients:
    if p.getLabel() == 1:
        numPos += 1
print('Total number of positive patients =', numPos)

Total number of positive patients = 83


## A hypothesis
- different subgroups of positive patients have different characteristics
- how might we test this?
- try some other values of k

In [65]:
patients = getData()
for k in (2, 4, 6):
    print('\n     Test k-means (k = ' + str(k) + ')')
    posFracs = testClustering(patients, k, 2)


     Test k-means (k = 2)
Cluster of size 118 with fraction of positives = 0.3305
Cluster of size 132 with fraction of positives = 0.3333

     Test k-means (k = 4)
Cluster of size 53 with fraction of positives = 0.2642
Cluster of size 85 with fraction of positives = 0.3529
Cluster of size 41 with fraction of positives = 0.3902
Cluster of size 71 with fraction of positives = 0.3239

     Test k-means (k = 6)
Cluster of size 41 with fraction of positives = 0.3902
Cluster of size 38 with fraction of positives = 0.2105
Cluster of size 38 with fraction of positives = 0.4211
Cluster of size 42 with fraction of positives = 0.381
Cluster of size 27 with fraction of positives = 0.3333
Cluster of size 64 with fraction of positives = 0.2812
