## Contents
- <a href='#7-1-1'>Exercise 7.1.1</a>
- <a href='#7-1-2'>Exercise 7.1.2</a>
- <a href='#7-2-1'>Exercise 7.2.1</a>
- <a href='#7-2-2'>Exercise 7.2.2</a>
- <a href='#7-2-3'>Exercise 7.2.3</a>
    - <a href='#7-2-3-a'>Exercise 7.2.3 (a)</a>
    - <a href='#7-2-3-b'>Exercise 7.2.3 (b)</a>
- <a href='#7-2-4'>Exercise 7.2.4</a>
- <a href='#7-2-5'>Exercise 7.2.5 Computing clustroids</a>

In [1]:
import numpy as np
from __future__ import division

<a id='7-1-1'></a>
# 7.1.1 

Monte Carlo Integration of E(|X-Y|) where X and Y are independent U(0,1) random variables.

In [2]:
n = 100000
sum = 0
for i in range(n):
    x = np.random.rand()
    y = np.random.rand()
    sum += np.abs(x-y)
sum/n

0.33451502070000289

<a id='7-1-2'></a>
# 7.1.2 

Simulation to estimate the expected Euclidean distance between two random points of a unit square.

In [6]:
n = 100000
sum_ed = 0
for i in range(n):
    x = np.random.rand(2)
    y = np.random.rand(2)
    sum_ed += np.sqrt((x[0]-y[0])**2+(x[1]-y[1])**2)
sum/n

0.33451502070000289

<a id='7-2-1'></a>
# 7.2.1

In [432]:
def Euclidean(x,y):
    """
    This function returns the Euclidean distance between two vectors
    of a Euclidean space.
    """
    xc = np.array(x)
    yc = np.array(y)
    return np.sqrt(np.dot(xc-yc,xc-yc))

In [382]:
# sqrt[(4-1)^2+(4-2)^2]
Euclidean(np.array([4,4]), np.array([1,2]))

3.6055512754639891

In [383]:
Euclidean([1],[0])

1.0

In [384]:
def mean(x):
    """
    This function takes as input a lists of the clusters and outputs
    the overall average of these clusters. This output is stored as
    a tuple so that it can be used to access the cluster index.
    """
    N = len(x)
    n = len(x[0])
    sum_vec = np.zeros(n)
    for point in x:
        sum_vec += np.array(point)
    mean_vec = sum_vec / N
    return tuple(mean_vec)

# mean([[1],[2]])
# mean([[1,2],[3,4],[5,6]])

In [428]:
def agg_(clusters, print_summary = True):
    """
    This function takes as input a dictionary of clusters in 
    Euclidean space and returns the Agglomerative clustering. 
    The key of the dictionary is the centroid of the corresponding
    cluster.
    
    Note that the clustering agglomerative clustering is done in
    place with respect to the clusters list input.
    """
    step = 1
    while len(clusters) > 1:
#     while step < 3:
        # clusters hash table (use centroids as hash keys)
        clusters_ix = {el[0]:i for i,el in enumerate(clusters)}
        # double loop to consider the minimal distance between all pairs of clusters
        n = len(clusters)
        min_dist = 2**32-1
        c1 = None
        c2 = None
        for i in range(n-1):
            for j in range(i+1,n):
                # the distance between centroids of cluster i and cluster j
                distance_ij = Euclidean(clusters[i][0], clusters[j][0])
                if distance_ij < min_dist:
                    min_dist = distance_ij
                    c1 = clusters[i]
                    c2 = clusters[j]
        # merge the two clusters that result in minimum Euclidean distance
        new_cluster = c1[1] + c2[1]
        new_centroid = mean(new_cluster)
        clusters.append([new_centroid, new_cluster])
        # remove the merged clusters from the list 
        del clusters[max(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        del clusters[min(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        if print_summary:
            print 'Step %d:' % step
            print 'Merged clusters: %s and %s' %(str(c1[1]),str(c2[1]))
            print 'Minimum distance: %f' % min_dist
            print 'New clusters list:'
            print [el[1] for el in clusters] 
            print 'New centroids:'
            print [el[0] for el in clusters]
            print ''
            print '--------------------------------------------------------'
            print ''
        step += 1

In [429]:
# an array storing centroid and cluster
clusters = [[(i**2,), [[i**2]]] for i in range(1,10)]
clusters

[[(1,), [[1]]],
 [(4,), [[4]]],
 [(9,), [[9]]],
 [(16,), [[16]]],
 [(25,), [[25]]],
 [(36,), [[36]]],
 [(49,), [[49]]],
 [(64,), [[64]]],
 [(81,), [[81]]]]

In [430]:
agg_(clusters)

Step 1:
Merged clusters: [[1]] and [[4]]
Minimum distance: 3.000000
New clusters list:
[[[9]], [[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[1], [4]]]
New centroids:
[(9,), (16,), (25,), (36,), (49,), (64,), (81,), (2.5,)]

--------------------------------------------------------

Step 2:
Merged clusters: [[9]] and [[1], [4]]
Minimum distance: 6.500000
New clusters list:
[[[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[9], [1], [4]]]
New centroids:
[(16,), (25,), (36,), (49,), (64,), (81,), (4.666666666666667,)]

--------------------------------------------------------

Step 3:
Merged clusters: [[16]] and [[25]]
Minimum distance: 9.000000
New clusters list:
[[[36]], [[49]], [[64]], [[81]], [[9], [1], [4]], [[16], [25]]]
New centroids:
[(36,), (49,), (64,), (81,), (4.666666666666667,), (20.5,)]

--------------------------------------------------------

Step 4:
Merged clusters: [[36]] and [[49]]
Minimum distance: 13.000000
New clusters list:
[[[64]], [[81]], [[9], [1], [4]], [[16],

In [431]:
#final clusters list
clusters

[[(31.666666666666668,), [[9], [1], [4], [16], [25], [36], [49], [64], [81]]]]

<a id='7-2-2'></a>
# 7.2.2

(a) Using the minimum distance between any two points, one from each cluster as the criterion for selecting two clusters to merge.

In [343]:
def mins(x,y):
    """
    This function takes as input two clusters of points (i.e. vectors) each of which 
    are represented by of their own individual lists. The output of this function
    is the minimum distance between any two points one from each cluster
    """
    nx = len(x)
    ny = len(y)
    running_min = 2**32 - 1
    for pt_x in x:
        for pt_y in y:
            if Euclidean(pt_x,pt_y) < running_min:
                running_min = Euclidean(pt_x,pt_y)
    return running_min

In [344]:
Euclidean([1],[3])

2.0

In [345]:
mins([[1]],[[3],[2],[0.5]])

0.5

In [425]:
def agg_(clusters, print_summary = True, dist = 'Euclidean'):
    """
    This function takes as input a dictionary of clusters in 
    Euclidean space and returns the Agglomerative clustering. 
    The key of the dictionary is the centroid of the corresponding
    cluster.
    
    Note that the clustering agglomerative clustering is done in
    place with respect to the clusters list input.
    """
    
    # specifying the distance function used
    if dist == 'Euclidean':
        f_dist = Euclidean
        r_ = 0
    if dist == 'mins':
        f_dist = mins
        r_ = 1
    step = 1    
    while len(clusters) > 1:
#     while step < 3:
        # clusters hash table (use centroids as hash keys)
        clusters_ix = {el[0]:i for i,el in enumerate(clusters)}
        # double loop to consider the minimal distance between all pairs of clusters
        n = len(clusters)
        min_dist = 2**32-1
        c1 = None
        c2 = None
        for i in range(n-1):
            for j in range(i+1,n):
                # the distance between centroids of cluster i and cluster j
                distance_ij = f_dist(clusters[i][r_], clusters[j][r_])
                if distance_ij < min_dist:
                    min_dist = distance_ij
                    c1 = clusters[i]
                    c2 = clusters[j]
        # merge the two clusters that result in minimum Euclidean distance
        new_cluster = c1[1] + c2[1]
        new_centroid = mean(new_cluster)
        clusters.append([new_centroid, new_cluster])
        # remove the merged clusters from the list 
        del clusters[max(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        del clusters[min(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        if print_summary:
            print 'Step %d:' % step
            print 'Merged clusters: %s and %s' %(str(c1[1]),str(c2[1]))
            print 'Minimum distance: %f' % min_dist
            print 'New clusters list:'
            print [el[1] for el in clusters] 
            print 'New centroids:'
            print [el[0] for el in clusters]
            print ''
            print '--------------------------------------------------------'
            print ''
        step += 1

In [426]:
# an array storing centroid and cluster
clusters = [[(i**2,), [[i**2]]] for i in range(1,10)]
clusters

[[(1,), [[1]]],
 [(4,), [[4]]],
 [(9,), [[9]]],
 [(16,), [[16]]],
 [(25,), [[25]]],
 [(36,), [[36]]],
 [(49,), [[49]]],
 [(64,), [[64]]],
 [(81,), [[81]]]]

In [427]:
agg_(clusters, dist = 'mins')

Step 1:
Merged clusters: [[1]] and [[4]]
Minimum distance: 3.000000
New clusters list:
[[[9]], [[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[1], [4]]]
New centroids:
[(9,), (16,), (25,), (36,), (49,), (64,), (81,), (2.5,)]

--------------------------------------------------------

Step 2:
Merged clusters: [[9]] and [[1], [4]]
Minimum distance: 5.000000
New clusters list:
[[[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[9], [1], [4]]]
New centroids:
[(16,), (25,), (36,), (49,), (64,), (81,), (4.666666666666667,)]

--------------------------------------------------------

Step 3:
Merged clusters: [[16]] and [[9], [1], [4]]
Minimum distance: 7.000000
New clusters list:
[[[25]], [[36]], [[49]], [[64]], [[81]], [[16], [9], [1], [4]]]
New centroids:
[(25,), (36,), (49,), (64,), (81,), (7.5,)]

--------------------------------------------------------

Step 4:
Merged clusters: [[25]] and [[16], [9], [1], [4]]
Minimum distance: 9.000000
New clusters list:
[[[36]], [[49]], [[64]], [[81]], 

(b) Using the average of the distances between pairs of points, one from each of the two clusters as our distance measurement.

In [360]:
def avg(x,y):
    """
    This function takes as input two clusters of points (i.e. vectors) each of which 
    are represented by of their own individual lists. The output of this function
    is the average distance between any two points one from each of the two clusters.
    """
    nx = len(x)
    ny = len(y)
    running_sum = 0
    for pt_x in x:
        for pt_y in y:
            running_sum += Euclidean(pt_x,pt_y)
    return running_sum/(nx*ny) # total number of pairs is nx*ny (i.e., by multiplication rule)

In [422]:
def agg_(clusters, print_summary = True, dist = 'Euclidean'):
    """
    This function takes as input a dictionary of clusters in 
    Euclidean space and returns the Agglomerative clustering. 
    The key of the dictionary is the centroid of the corresponding
    cluster.
    
    Note that the clustering agglomerative clustering is done in
    place with respect to the clusters list input.
    """
    
    # specifying the distance function used
    if dist == 'Euclidean':
        f_dist = Euclidean
        r_ = 0 # index to be used in the argument
    if dist == 'mins':
        f_dist = mins
        r_ = 1
    if dist == 'avg':
        f_dist = avg
        r_ = 1
    
    # start main code to conduct clustering
    step = 1    
    while len(clusters) > 1:
#     while step < 3:
        # clusters hash table (use centroids as hash keys)
        clusters_ix = {el[0]:i for i,el in enumerate(clusters)}
        # double loop to consider the minimal distance between all pairs of clusters
        n = len(clusters)
        min_dist = 2**32-1
        c1 = None
        c2 = None
        for i in range(n-1):
            for j in range(i+1,n):
                # the distance between centroids of cluster i and cluster j
                distance_ij = f_dist(clusters[i][r_], clusters[j][r_])
                if distance_ij < min_dist:
                    min_dist = distance_ij
                    c1 = clusters[i]
                    c2 = clusters[j]
        # merge the two clusters that result in minimum Euclidean distance
        new_cluster = c1[1] + c2[1]
        new_centroid = mean(new_cluster)
        clusters.append([new_centroid, new_cluster])
        # remove the merged clusters from the list 
        del clusters[max(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        del clusters[min(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        if print_summary:
            print 'Step %d:' % step
            print 'Merged clusters: %s and %s' %(str(c1[1]),str(c2[1]))
            print 'Minimum distance: %f' % min_dist
            print 'New clusters list:'
            print [el[1] for el in clusters] 
            print 'New centroids:'
            print [el[0] for el in clusters]
            print ''
            print '--------------------------------------------------------'
            print ''
        step += 1
    
# Alternatively, can use np.mean to create the new centroid
# new_centroid = tuple(np.mean(np.array(new_cluster),axis=0))

In [423]:
# an array storing centroid and cluster
clusters = [[(i**2,), [[i**2]]] for i in range(1,10)]
clusters

[[(1,), [[1]]],
 [(4,), [[4]]],
 [(9,), [[9]]],
 [(16,), [[16]]],
 [(25,), [[25]]],
 [(36,), [[36]]],
 [(49,), [[49]]],
 [(64,), [[64]]],
 [(81,), [[81]]]]

In [424]:
agg_(clusters,dist='avg')

Step 1:
Merged clusters: [[1]] and [[4]]
Minimum distance: 3.000000
New clusters list:
[[[9]], [[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[1], [4]]]
New centroids:
[(9,), (16,), (25,), (36,), (49,), (64,), (81,), (2.5,)]

--------------------------------------------------------

Step 2:
Merged clusters: [[9]] and [[1], [4]]
Minimum distance: 6.500000
New clusters list:
[[[16]], [[25]], [[36]], [[49]], [[64]], [[81]], [[9], [1], [4]]]
New centroids:
[(16,), (25,), (36,), (49,), (64,), (81,), (4.666666666666667,)]

--------------------------------------------------------

Step 3:
Merged clusters: [[16]] and [[25]]
Minimum distance: 9.000000
New clusters list:
[[[36]], [[49]], [[64]], [[81]], [[9], [1], [4]], [[16], [25]]]
New centroids:
[(36,), (49,), (64,), (81,), (4.666666666666667,), (20.5,)]

--------------------------------------------------------

Step 4:
Merged clusters: [[36]] and [[49]]
Minimum distance: 13.000000
New clusters list:
[[[64]], [[81]], [[9], [1], [4]], [[16],

<a id='7-2-3'></a>
# 7.2.3

Using Euclidean distance of the centroids as for Example 7.2 .

In [413]:
clusters = [[(4,10),[[4,10]]], [(7,10),[[7,10]]], [(4,8),[[4,8]]],
           [(6,8),[[6,8]]],[(3,4),[[3,4]]],[(2,2),[[2,2]]],[(5,2),[[5,2]]],
           [(12,6),[[12,6]]],[(10,5),[[10,5]]],[(11,4),[[11,4]]],[(9,3),[[9,3]]],
           [(12,3),[[12,3]]]]

In [414]:
clusters

[[(4, 10), [[4, 10]]],
 [(7, 10), [[7, 10]]],
 [(4, 8), [[4, 8]]],
 [(6, 8), [[6, 8]]],
 [(3, 4), [[3, 4]]],
 [(2, 2), [[2, 2]]],
 [(5, 2), [[5, 2]]],
 [(12, 6), [[12, 6]]],
 [(10, 5), [[10, 5]]],
 [(11, 4), [[11, 4]]],
 [(9, 3), [[9, 3]]],
 [(12, 3), [[12, 3]]]]

In [415]:
agg_(clusters)

Step 1:
Merged clusters: [[10, 5]] and [[11, 4]]
Minimum distance: 1.414214
New clusters list:
[[[4, 10]], [[7, 10]], [[4, 8]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]]]
New centroids:
[(4, 10), (7, 10), (4, 8), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5)]

--------------------------------------------------------

Step 2:
Merged clusters: [[4, 10]] and [[4, 8]]
Minimum distance: 2.000000
New clusters list:
[[[7, 10]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]], [[4, 10], [4, 8]]]
New centroids:
[(7, 10), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5), (4.0, 9.0)]

--------------------------------------------------------

Step 3:
Merged clusters: [[12, 6]] and [[10, 5], [11, 4]]
Minimum distance: 2.121320
New clusters list:
[[[7, 10]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[9, 3]], [[12, 3]], [[4, 10], [4, 8]], [[12, 6], [10, 5], [11, 4]]]

Double checking the mean() function using numpy built in function.

In [416]:
mean([[9, 3], [12, 3], [12, 6], [10, 5], [11, 4], [4, 10], [4, 8], [7, 10], [6, 8], [5, 2], [3, 4], [2, 2]])

(7.083333333333333, 5.416666666666667)

In [417]:
a = np.array([[9, 3], [12, 3], [12, 6], [10, 5], [11, 4], [4, 10], [4, 8], [7, 10], [6, 8], [5, 2], [3, 4], [2, 2]])

In [418]:
np.mean(a, axis = 0)

array([ 7.08333333,  5.41666667])

<a id='7-2-3-a'></a>
## 7.2.3 (a) 

In [448]:
def radius(x,y=[]):
    """
    This function takes as input two clusters of points (i.e. vectors) each of which 
    are represented by of their own individual lists. The output of this function
    is the radius of the custer which results from the merge of x and y.
    
    If the input is simply one cluster, then the output is the radius of that
    cluster.
    """
    nx = len(x)
    ny = len(y)
    # merge two clusters x and y
    merged_clus = x + y 
    # the centroid of the new merged cluster
    merged_cent = mean(merged_clus)
    # determine the radius of this merged cluster
    # radius is the maximum distance between all the points and the centroid
    radius = 0
    for pt in merged_clus:
        if Euclidean(pt,merged_cent) > radius:
            radius = Euclidean(pt,merged_cent)
    return radius

In [442]:
# checking this function with the result of Example 7.4
radius([[12,6]],[[10,5],[11,4],[12,3],[9,3]])

2.1633307652783942

In [444]:
# can also use radius on a single cluster
radius([[12,6],[10,5],[11,4],[12,3],[9,3]])

2.1633307652783942

In [445]:
def agg_(clusters, print_summary = True, dist = 'Euclidean'):
    """
    This function takes as input a dictionary of clusters in 
    Euclidean space and returns the Agglomerative clustering. 
    The key of the dictionary is the centroid of the corresponding
    cluster.
    
    Note that the clustering agglomerative clustering is done in
    place with respect to the clusters list input.
    """
    
    # specifying the distance function used
    if dist == 'Euclidean':
        f_dist = Euclidean
        r_ = 0 # index to be used in the argument
    if dist == 'mins':
        f_dist = mins
        r_ = 1
    if dist == 'avg':
        f_dist = avg
        r_ = 1
    if dist == 'radius':
        f_dist = radius
        r_ = 1
    
    # start main code to conduct clustering
    step = 1    
    while len(clusters) > 1:
#     while step < 3:
        # clusters hash table (use centroids as hash keys)
        clusters_ix = {el[0]:i for i,el in enumerate(clusters)}
        # double loop to consider the minimal distance between all pairs of clusters
        n = len(clusters)
        min_dist = 2**32-1
        c1 = None
        c2 = None
        for i in range(n-1):
            for j in range(i+1,n):
                # the distance between centroids of cluster i and cluster j
                distance_ij = f_dist(clusters[i][r_], clusters[j][r_])
                if distance_ij < min_dist:
                    min_dist = distance_ij
                    c1 = clusters[i]
                    c2 = clusters[j]
        # merge the two clusters that result in minimum Euclidean distance
        new_cluster = c1[1] + c2[1]
        new_centroid = mean(new_cluster)
        clusters.append([new_centroid, new_cluster])
        # remove the merged clusters from the list 
        del clusters[max(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        del clusters[min(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        if print_summary:
            print 'Step %d:' % step
            print 'Merged clusters: %s and %s' %(str(c1[1]),str(c2[1]))
            print 'Minimum distance: %f' % min_dist
            print 'New clusters list:'
            print [el[1] for el in clusters] 
            print 'New centroids:'
            print [el[0] for el in clusters]
            print ''
            print '--------------------------------------------------------'
            print ''
        step += 1
    
# Alternatively, can use np.mean to create the new centroid
# new_centroid = tuple(np.mean(np.array(new_cluster),axis=0))

In [446]:
clusters = [[(4,10),[[4,10]]], [(7,10),[[7,10]]], [(4,8),[[4,8]]],
           [(6,8),[[6,8]]],[(3,4),[[3,4]]],[(2,2),[[2,2]]],[(5,2),[[5,2]]],
           [(12,6),[[12,6]]],[(10,5),[[10,5]]],[(11,4),[[11,4]]],[(9,3),[[9,3]]],
           [(12,3),[[12,3]]]]

In [447]:
agg_(clusters, dist = 'radius')

Step 1:
Merged clusters: [[10, 5]] and [[11, 4]]
Minimum distance: 0.707107
New clusters list:
[[[4, 10]], [[7, 10]], [[4, 8]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]]]
New centroids:
[(4, 10), (7, 10), (4, 8), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5)]

--------------------------------------------------------

Step 2:
Merged clusters: [[4, 10]] and [[4, 8]]
Minimum distance: 1.000000
New clusters list:
[[[7, 10]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]], [[4, 10], [4, 8]]]
New centroids:
[(7, 10), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5), (4.0, 9.0)]

--------------------------------------------------------

Step 3:
Merged clusters: [[7, 10]] and [[6, 8]]
Minimum distance: 1.118034
New clusters list:
[[[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]], [[4, 10], [4, 8]], [[7, 10], [6, 8]]]
New centr

<a id='7-2-3-b'></a>
## 7.2.3 (b)

In [450]:
def diameter(x,y=[]):
    """
    This function takes as input two clusters of points (i.e. vectors) each of which 
    are represented by of their own individual lists. The output of this function
    is the diameter of the merged custer of x and y.
    
    If the input is simply one cluster, then the output is the diameter of that
    cluster.
    """
    # merge two clusters x and y
    merged_clus = x + y 
    n = len(merged_clus)
    # determine the diameter of this merged cluster
    # diameter is the maximum distance between any two points of the cluster
    diameter = 0
    for i in range(n-1):
        for j in range(i+1,n):
            distance_ij = Euclidean(merged_clus[i],merged_clus[j])
            if distance_ij > diameter:
                diameter = distance_ij
    return diameter

In [451]:
# checking result with Example 7.4
diameter([[12,6]],[[10,5],[11,4],[12,3],[9,3]])

4.2426406871192848

In [452]:
# using only one argument
diameter([[12,6],[10,5],[11,4],[12,3],[9,3]])

4.2426406871192848

In [454]:
def agg_(clusters, print_summary = True, dist = 'Euclidean'):
    """
    This function takes as input a dictionary of clusters in 
    Euclidean space and returns the Agglomerative clustering. 
    The key of the dictionary is the centroid of the corresponding
    cluster.
    
    Note that the clustering agglomerative clustering is done in
    place with respect to the clusters list input.
    """
    
    # specifying the distance function used
    # r_ = 0 implies we consider centroids of the two clusters in merge step
    # r_ = 1 means that we consider the points of the two clusters themselves in merge step
    if dist == 'Euclidean':
        f_dist = Euclidean
        r_ = 0 
    if dist == 'mins':
        f_dist = mins
        r_ = 1 
    if dist == 'avg':
        f_dist = avg
        r_ = 1
    if dist == 'radius':
        f_dist = radius
        r_ = 1
    if dist == 'diameter':
        f_dist = diameter
        r_ = 1
    
    # start main code to conduct clustering
    step = 1    
    while len(clusters) > 1:
#     while step < 3:
        # clusters hash table (use centroids as hash keys)
        clusters_ix = {el[0]:i for i,el in enumerate(clusters)}
        # double loop to consider the minimal distance between all pairs of clusters
        n = len(clusters)
        min_dist = 2**32-1
        c1 = None
        c2 = None
        for i in range(n-1):
            for j in range(i+1,n):
                # the distance between centroids of cluster i and cluster j
                distance_ij = f_dist(clusters[i][r_], clusters[j][r_])
                if distance_ij < min_dist:
                    min_dist = distance_ij
                    c1 = clusters[i]
                    c2 = clusters[j]
        # merge the two clusters that result in minimum Euclidean distance
        new_cluster = c1[1] + c2[1]
        new_centroid = mean(new_cluster)
        clusters.append([new_centroid, new_cluster])
        # remove the merged clusters from the list 
        del clusters[max(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        del clusters[min(clusters_ix[c1[0]],clusters_ix[c2[0]])]
        if print_summary:
            print 'Step %d:' % step
            print 'Merged clusters: %s and %s' %(str(c1[1]),str(c2[1]))
            print 'Minimum distance: %f' % min_dist
            print 'New clusters list:'
            print [el[1] for el in clusters] 
            print 'New centroids:'
            print [el[0] for el in clusters]
            print ''
            print '--------------------------------------------------------'
            print ''
        step += 1
    
# Alternatively, can use np.mean to create the new centroid
# new_centroid = tuple(np.mean(np.array(new_cluster),axis=0))

In [455]:
clusters = [[(4,10),[[4,10]]], [(7,10),[[7,10]]], [(4,8),[[4,8]]],
           [(6,8),[[6,8]]],[(3,4),[[3,4]]],[(2,2),[[2,2]]],[(5,2),[[5,2]]],
           [(12,6),[[12,6]]],[(10,5),[[10,5]]],[(11,4),[[11,4]]],[(9,3),[[9,3]]],
           [(12,3),[[12,3]]]]

In [456]:
agg_(clusters, dist = 'diameter')

Step 1:
Merged clusters: [[10, 5]] and [[11, 4]]
Minimum distance: 1.414214
New clusters list:
[[[4, 10]], [[7, 10]], [[4, 8]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]]]
New centroids:
[(4, 10), (7, 10), (4, 8), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5)]

--------------------------------------------------------

Step 2:
Merged clusters: [[4, 10]] and [[4, 8]]
Minimum distance: 2.000000
New clusters list:
[[[7, 10]], [[6, 8]], [[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]], [[4, 10], [4, 8]]]
New centroids:
[(7, 10), (6, 8), (3, 4), (2, 2), (5, 2), (12, 6), (9, 3), (12, 3), (10.5, 4.5), (4.0, 9.0)]

--------------------------------------------------------

Step 3:
Merged clusters: [[7, 10]] and [[6, 8]]
Minimum distance: 2.236068
New clusters list:
[[[3, 4]], [[2, 2]], [[5, 2]], [[12, 6]], [[9, 3]], [[12, 3]], [[10, 5], [11, 4]], [[4, 10], [4, 8]], [[7, 10], [6, 8]]]
New centr

<a id='7-2-4'></a>
# 7.2.4 

(a) Computing density if we take # of points in a cluster divided by square of the radius

In [478]:
# lists of the three clusters of Figure 7.2
c1 = [[4,10],[4,8],[6,8],[7,10]]
c2 = [[3,4],[2,2],[5,2]]
c3 = [[10,5],[9,3],[11,4],[12,3],[12,6]]

# printing the centers of these clusters
print mean(c1)
print mean(c2)
print mean(c3)

(5.25, 9.0)
(3.3333333333333335, 2.6666666666666665)
(10.800000000000001, 4.2000000000000002)


In [481]:
# compute the radius of each cluster
radius1 = radius(c1)
radius2 = radius(c2)
radius3 = radius(c3)

In [482]:
# compute density
density1 = len(c1)/radius1**2
density2 = len(c2)/radius2**2
density3 = len(c3)/radius3**2

[density1, density2, density3]

[0.98461538461538478, 0.93103448275862088, 1.0683760683760679]

In [490]:
# density of merging any two of these clusers
c12 = c1 + c2
c13 = c1 + c3
c23 = c2 + c3

radius12 = radius(c12)
radius13 = radius(c13)
radius23 = radius(c23)

density12 = len(c12)/radius12**2
density13 = len(c13)/radius13**2
density23 = len(c23)/radius23**2

[density12, density13, density23]

[0.28847771236333053, 0.27931034482758615, 0.20703598867771936]

**The densities have decreased substantially. Hence, merging clusters do not result in adequate clusters.**

(b) repeating (a) except this time the denominator of the density is given by diameter

In [486]:
# compute the diameter of each cluster
diameter1 = diameter(c1)
diameter2 = diameter(c2)
diameter3 = diameter(c3)

[diameter1,diameter2,diameter3]

[3.6055512754639891, 3.0, 4.2426406871192848]

In [485]:
# compute density
density1 = len(c1)/diameter1
density2 = len(c2)/diameter2
density3 = len(c3)/diameter3

[density1, density2, density3]

[1.1094003924504583, 1.0, 1.1785113019775793]

In [492]:
# density of merging any two of these clusers
c12 = c1 + c2
c13 = c1 + c3
c23 = c2 + c3

diameter12 = diameter(c12)
diameter13 = diameter(c13)
diameter23 = diameter(c23)

density12 = len(c12)/diameter12
density13 = len(c13)/diameter13
density23 = len(c23)/diameter23

[density12, density13, density23]

[0.74199851600445199, 0.84664878154523748, 0.74278135270820744]

**The densities have decreased substantially. Hence, merging clusters do not result in adequate clusters.**

<a id='7-2-5'></a>
# 7.2.5 Computing clustroids 

Criterion for selecting the clustroid: the point with the minimum sum of distance to other point in the cluster.

In [503]:
def clustroid(x):
    """
    This function takes as an input a cluster x (i.e. a list of points)
    and returns the clustroid of that cluster. The clustroid is 
    a point within the cluster itself which "best" represents the cluster
    to which it belongs.
    
    The criteria for this function is the minimum sum of distance to the 
    other points in the cluster.
    
    Might be a good idea to use a heap.
    """
    sums = {tuple(pt):0 for pt in x}
    sum = 0
    for pt in x:
        for pt_compare in x:
            sum += Euclidean(pt,pt_compare) # can do this because Euclidean(x,x)=0
        sums[tuple(pt)] = sum
    return min(sums.values()), sums

In [504]:
clustroid(c1)

(7.8284271247461898,
 {(4, 8): 15.433978400210179,
  (4, 10): 7.8284271247461898,
  (6, 8): 22.498473502456161,
  (7, 10): 31.340092755419938})

In [505]:
clustroid(c2)

(5.0644951022459797,
 {(2, 2): 10.300563079745769,
  (3, 4): 5.0644951022459797,
  (5, 2): 16.128990204491959})

In [506]:
clustroid(c3)

(8.7147766421188653,
 {(9, 3): 20.429553284237727,
  (10, 5): 8.7147766421188653,
  (11, 4): 27.730116363983498,
  (12, 3): 37.972757051102782,
  (12, 6): 49.687533693221653})