### CS4102 - Geometric Foundations of Data Analysis I
Prof. Götz Pfeiffer<br />
School of Mathematical and Statistical Sciences<br />
University of Galway

# Week 11: Clustering

1. Implement the Clustering Algorithm (single-linkage)
2. Generate random clustered points
3. Compute distance matrix for given points
4. Apply (1) to (3)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## 0. Distances

* From last week's discussion of abstract classes:

In [None]:
class Distance:
    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return self.name

    def __call__(self, p, q):
        raise NotImplementedError(f"don't know yet how to {self}(p, q).")
        
    def flatten(self, p, q):
        p = np.array(p).flatten()
        q = np.array(q).flatten()
        return p - q

* Then all the subclasses essentially are one-liners ...

In [None]:
class EuclideanDist(Distance):
    def __call__(self, p, q):
        return np.sqrt(np.sum(self.flatten(p, q)**2))

In [None]:
e = EuclideanDist("e")
e([1,2,3],[4,5,6])

In [None]:
class TaxicabDist(Distance):
    def __call__(self, p, q):
        return np.sum(np.abs(self.flatten(p, q)))   

In [None]:
t = TaxicabDist("t")
t([1,2,3],[4,5,6])

In [None]:
class InfinityDist(Distance):
    def __call__(self, p, q):
        return np.max(np.abs(self.flatten(p, q)))   

In [None]:
i = InfinityDist("i")
i([1,2,3],[4,5,6])

## 2. Random Clusters

In [None]:
import random

* The `random` library has a `normalvariate` method that generates normally distributed data

In [None]:
random.normalvariate?

In [None]:
values = [random.normalvariate(2, 1) for _ in range(1000)]

In [None]:
plt.plot(values, 'r.')

* Let's check how much of a bell curve is formed by these random data

In [None]:
from collections import Counter
bins = [int(2*x) for x in values]
counter = Counter(bins)
print(counter)

In [None]:
plt.plot(counter.keys(), counter.values(), 'bo')

* Now for a $2$-dimensional cluster ...

In [None]:
centre = (1,2)
count = 100
xx = [random.normalvariate(centre[0], 1) for _ in range(count)]
yy = [random.normalvariate(centre[1], 1) for _ in range(count)]
plt.plot(xx, yy, 'bo')

In [None]:
def cluster_points(centre, count, spread):
    xx = [random.normalvariate(centre[0], spread) for _ in range(count)]
    yy = [random.normalvariate(centre[1], spread) for _ in range(count)]
    return xx, yy

In [None]:
pp = cluster_points((0,0), 10, 1)
print(pp)

In [None]:
plt.plot(*pp, 'g*')

In [None]:
centres = [(x, y) for x in range(4) for y in range(4)]
print(centres)

In [None]:
xx = []
yy = []
for c in centres:
    pp = cluster_points(c, 100, 0.2)
    xx.extend(pp[0])
    yy.extend(pp[1])
plt.plot(xx, yy, 'm.')

In [None]:
def random_clusters(centres, count, spread):
    xx = []
    yy = []
    for c in centres:
        pp = cluster_points(c, count, spread)
        xx.extend(pp[0])
        yy.extend(pp[1])
    return xx, yy

In [None]:
cc = random_clusters(centres, 10, 0.1)
plt.plot(*cc, 'g+')

## 1. Clustering Algorithm

In [None]:
plt.plot(*cluster_points(centres[0], 10, 0.1 ),'bo')

* Sample data from the lectures.

In [None]:
distances = [
    [],
    [11],
    [10, 3],
    [14, 13, 12],
    [22, 21, 20, 16],
]
for j in range(5):
    for i in range(j):
        distances[i].append(distances[j][i])
    distances[j].append(0)
distances = np.array(distances, dtype=float)

In [None]:
distances

### Find

1. the minimal nonzero value
2. its $(i,j)$-position

* We'll try a 'naive' and straightforward method first

In [None]:
min_val = 10000
min_pos = None
r, c = distances.shape
r, c

In [None]:
for i in range(r):
    for j in range(c):
        val = distances[i, j]
        if val and val < min_val:
            min_val = val
            min_pos = (i, j)
            

In [None]:
min_val

In [None]:
min_pos

* Next, let's see how Numpy could help
* There is a function `np.min`

In [None]:
np.min?

In [None]:
np.min(distances)

* Oops. Of course we want a *nonzero* minimum.
* Let's try `np.max` instead.

In [None]:
np.max(distances)

* That looks right.
* But where in the matrix is it?

In [None]:
np.max?

In [None]:
np.argmax(distances)

* And what is that supposed to mean?

In [None]:
np.argmax?

* Ah: it's the index in the flattened version of `distances`.
* To unravel its position in terms of rows and columns of the original matrix, we're told to use `unravel_index`.

In [None]:
np.unravel_index(np.argmax(distances), distances.shape)

* Now for the minimum.
* In order to have Numpy ignore $0$s, we apply a trick and turn all $0$ entries into `np.nan`.
* The works because Numpy exposes the underlying IEEE 745 [floating point arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic)

In [None]:
for i in range(r):
    distances[i][i] = np.nan
distances

In [None]:
np.min(distances)

In [None]:
np.min?

In [None]:
np.nanmin(distances)

In [None]:
np.nanmin?

In [None]:
np.nanargmin(distances)

In [None]:
i, j = np.unravel_index(np.nanargmin(distances), distances.shape)
i, j

In [None]:
distances[[i,j]]

In [None]:
np.delete?

In [None]:
mat = np.delete(distances, (i,j), axis=1)
mat

In [None]:
mat[[i,j]]

In [None]:
new = np.min(mat[[i,j]], axis=0)
new

In [None]:
mat = np.delete(mat, (i,j), axis=0)
mat

In [None]:
mat = np.append(mat, [new], axis=0)
mat

In [None]:
new = np.append(new, [np.nan], axis=0)
new

In [None]:
mat = np.append(mat, new.reshape(-1, 1), axis=1)
mat

* Collate these operations into a function (assuming that the diagonal values of `distances` are `np.nan`):

In [None]:
def collapse(distances):
    min_val = np.nanmin(distances)
    i, j = np.unravel_index(np.nanargmin(distances), distances.shape)
    mat = np.delete(distances, (i,j), axis=1)
    new = np.min(mat[[i,j]], axis=0)
    mat = np.delete(mat, (i,j), axis=0)
    mat = np.append(mat, [new], axis=0)
    new = np.append(new, [np.nan], axis=0)
    mat = np.append(mat, new.reshape(-1, 1), axis=1)
    return min_val, mat

In [None]:
collapse(distances)

In [None]:
L = [0]
mat = distances

In [None]:
val, mat = collapse(mat)
L.append(val)
mat

In [None]:
val, mat = collapse(mat)
L.append(val)
mat

In [None]:
val, mat = collapse(mat)
L.append(val)
mat

In [None]:
val, mat = collapse(mat)
L.append(val)
mat

In [None]:
L

### Clustering Algorithm

In [None]:
def clustering(distances):
    L = [0.0]
    mat = distances
    while len(mat) > 1:
        val, mat = collapse(mat)
        L.append(val)
    return L

In [None]:
clustering(distances)

## 3. Distance Matrix

In [None]:
def distances(points, dist):
    dist = np.array([[dist(x, y) for x in points] for y in points])
    for i in range(len(dist)):
        dist[i][i] = np.nan
    return dist

In [None]:
e

In [None]:
points = np.array(cc).T
points[0]

In [None]:
dd = distances(points, e)
dd

## 4. Apply (1) to (3)

In [None]:
clu = clustering(dd)
print(clu)

* Barcodes:

In [None]:
plt.plot(clu, range(160), 'r>')

In [None]:
plt.bar(range(160), clu)

In [None]:
clu[160 - 16]

In [None]:
clu[160 - 16 + 1]