# max-spacing k-clustering.

## Question 1

In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing k-clustering.

The file clustering1.txt describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]

[edge 1 node 1] [edge 1 node 2] [edge 1 cost]

[edge 2 node 1] [edge 2 node 2] [edge 2 cost]

...

There is one edge (i,j) for each choice of 1≤i<j≤n, where n is the number of nodes.

For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number k of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!



# Union Find
Implement the union find data structure with path compression

In [233]:
import random
import numpy as np

class union_find_pc(object):    
    def __init__(self, nodes):
        self.leader = [i for i in range(len(nodes))]
        self.rank = [0 for i in range(len(nodes))]
        self.clusters = set([i for i in range(len(nodes))])
        
    def FIND(self, x):           
        n = x
        
        while self.leader[n] != n:
            n = self.leader[n]
            
        self.leader[x] = n   
        return n
     
    def UNION(self, a, b):   
        
        a = self.FIND(a)
        b = self.FIND(b)
#         print ("parent of a {} - parent of b {}".format(a,b))
        if (a == b):
            return 
        
        if (self.rank[a] == self.rank[b]):
            # make a the leader of b and all its object            
            flip = random.random() > 0.5
#             print ("flip", flip)
            if flip==1:
                self.leader[b] = a                             
                self.rank[a] += 1
            else:
                self.leader[a] = b                             
                self.rank[b] += 1
                
        elif (self.rank[a] > self.rank[b]):
            # make a the leader of b and all its object
            self.leader[b] = a                             
        else:
            self.leader[a] = b 
            
    def getClusters(self): 
        n_clusters = []
        for i in range(len(self.leader)):
            if self.leader[i] == i:
                n_clusters.append(i)
        return n_clusters

Test the Union-Find data structure on a simple test case

In [12]:
obj = union_find_pc([0,1,2,3,4])

print ("=UNION(0,1)")
obj.UNION(0,1)
print (obj.FIND(1))
print (obj.FIND(2))
print ("leaders", obj.leader)
print ("rank", obj.rank)
print ("clusters", obj.getClusters())

print ("=UNION(2,3)")
obj.UNION(2,3)
print ("leaders", obj.leader)
print ("rank", obj.rank)
print ("clusters", obj.getClusters())

print ("=UNION(0,2)")

obj.UNION(0,2)
print ("leaders", obj.leader)
print ("rank", obj.rank)
print ("clusters", obj.getClusters())

print ("=UNION(0,4)")
obj.UNION(0,4)
print ("leaders", obj.leader)
print ("rank", obj.rank)

print ("clusters", obj.getClusters())

=UNION(0,1)
0
2
leaders [0, 0, 2, 3, 4]
rank [1, 0, 0, 0, 0]
clusters [0, 2, 3, 4]
=UNION(2,3)
leaders [0, 0, 3, 3, 4]
rank [1, 0, 0, 1, 0]
clusters [0, 3, 4]
=UNION(0,2)
leaders [3, 0, 3, 3, 4]
rank [1, 0, 0, 2, 0]
clusters [3, 4]
=UNION(0,4)
leaders [3, 0, 3, 3, 3]
rank [1, 0, 0, 2, 0]
clusters [3]


## Problem 1 Solution

In [262]:
FILE = "./clustering1.txt"
# FILE = "./clustering1-example-500-solution-2639.txt"
K =  4 
fp = open(FILE, 'r')

n_nodes = int(fp.readline())

edges = []
vertices = set()
MAX_WEIGHT = 0

for row in fp.readlines():
    r = row.strip().split(" ")
    
    vertices.add(int(r[0]))
    vertices.add(int(r[1]))
    weight = int(r[2])
    if weight > MAX_WEIGHT:
        MAX_WEIGHT = weight
        
    edges.append([int(r[0]),int(r[1]), int(r[2])])
    
sortedEdges = sorted(edges, key=lambda x: x[2])
# print (sortedEdges)

vertices = list(vertices)
v_to_idx = {vertices[i]:i for i in range(len(vertices))}

obj = union_find_pc([i for i in range(n_nodes)])

for i, edge in enumerate(sortedEdges):  
    v1 = v_to_idx[edge[0]]
    v2 = v_to_idx[edge[1]]
    w = edge[2]
    
    obj.UNION(v1, v2)
    
    if len(obj.getClusters()) == 4:
        print ("final Clusters", obj.getClusters())
        break
        
minW = MAX_WEIGHT
for i, edge in enumerate(sortedEdges):  
    v1 = v_to_idx[edge[0]]
    v2 = v_to_idx[edge[1]]
    w = edge[2]
    
    if obj.FIND(v1) != obj.FIND(v2):
        if w < minW:
            minW = w
print ("maximum spacing", minW)            

final Clusters [125, 383, 413, 461]
maximum spacing 106


## Question 2

in this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The data set is below. clustering_big.txt

The format is:

[# of nodes] [# of bits for each node's label]

[first bit of node 1] ... [last bit of node 1]

[first bit of node 2] ... [last bit of node 2]

...

For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of k such that there is a k-clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

In [263]:
from itertools import combinations

def distance_0_items(data):
    dist0_present = collections.defaultdict(list)
    dist0_keys = collections.defaultdict(list)
    dist0_list = []
    for i, d in enumerate(data):
        if d in dist0_keys:
            dist0_present[d].append(i)
        else:
            dist0_keys[d] = i

    for d in dist0_present:
        dist0_present[d].append(dist0_keys[d])

    for k, d in dist0_present.items():    
        # create all combinations of 2 items which match the key
        out = combinations(d, 2)
        for c in out:
            dist0_list.append([c[0], c[1]])

    return dist0_list

def distance_1_items(data, n_bits):
    
    def dist_1(a, n_bits, out, index=None):
        for i in range(n_bits):
            dist_1 = int(a,2) ^ (1 << i)   
            # many nodes will generate similar other nodes of distance 1 from themselves
            out[dist_1].append(index)  
            
    dist1_present = collections.defaultdict(list)
    dist1_keys = collections.defaultdict(list)
    dist1_list = []
    
    # For each node, generate all other nodes with distance 1 and their idx in a hash table list
    # If a later node is found in the list, then add it to the dist1_present
    for i, d in enumerate(data):
        d_val = int(d,2)
        if d_val in dist1_keys:
            # nodes of distance 1 to d exists, store it in dist1_present
            dist1_present[d_val].append(i)
        dist_1(d, n_bits, dist1_keys, i)

    for key, val in dist1_present.items():
        # create all combinations of 2 items which match the key
        for d in val:
            for c in dist1_keys[key]:
                dist1_list.append([d, c])

    return dist1_list


def distance_2_items(data, n_bits):
    def dist_2(a, n_bits, out, index=None):
        for i in range(0,n_bits-1):
            for j in range(i+1, n_bits):
                dist_2 = int(a,2) ^ (1 << i)     
                dist_2 = dist_2 ^ (1 << j)     
                # many nodes will generate similar other nodes of distance 2 from themselves
                out[dist_2].append(index)  
        
    dist2_present = collections.defaultdict(list)
    dist2_keys = collections.defaultdict(list)
    dist2_list = []
    
    # For each node, generate all other nodes with distance 1 and their idx in a hash table list
    # If a later node is found in the list, then add it to the dist1_present
    for i, d in enumerate(data):
        d_val = int(d,2)
        if d_val in dist2_keys:
            # nodes of distance 1 to d exists, store it in dist1_present
            dist2_present[d_val].append(i)
        dist_2(d, n_bits, dist2_keys, i)

    for key, val in dist2_present.items():
        # create all combinations of 2 items which match the key
        for d in val:
            for c in dist2_keys[key]:
                dist2_list.append([d, c])

    return dist2_list
    
def solve_p2(data, n_bits, n_nodes):
    dist0_list = distance_0_items(data)
    dist1_list = distance_1_items(data, n_bits)
    dist2_list = distance_2_items(data, n_bits)

    obj = union_find_pc([i for i in range(n_nodes)])
    for i, edge in enumerate(dist0_list):  
        v1 = edge[0]
        v2 = edge[1]    
        obj.UNION(v1, v2)

    for i, edge in enumerate(dist1_list):  
        v1 = edge[0]
        v2 = edge[1]    
        obj.UNION(v1, v2)

    for i, edge in enumerate(dist2_list):  
        v1 = edge[0]
        v2 = edge[1]    
        obj.UNION(v1, v2)

    # After clustering together all points of 0, 1, and 2 distance, we get all those points together and all
    # remaining one are further away. Each other point being its own cluster, 
    # this is the max number of clusters total
    return len(obj.getClusters())

In [264]:
TEST_CASES = [  
                ["./clustering2-example-200-12-solution-4.txt", 4],
                ["./clustering2-example-200-12-solution-6.txt", 6],
                ["clustering2-example-2000-24-solution-1575.txt", 1575]
             ]

for test in TEST_CASES:
    file = test[0]
    solution = test[1]
    
    fp = open(file, 'r')
    n_nodes, n_bits = fp.readline().strip().split(" ")
    n_nodes, n_bits = int(n_nodes), int(n_bits)

    data = []
    for row in fp.readlines():
        a = "".join(row.strip().split(" "))
        data.append(a)

    solved = solve_p2(data, n_bits, n_nodes)

    assert (solved == solution), ("Expected {}, got {}, file {}").format(solution, solved, file)
    
print ("PASSED ALL TESTS!")

PASSED ALL TESTS!


## Problem 2 Solution

In [None]:
FILE = "./clustering_big.txt"

fp = open(FILE, 'r')
n_nodes, n_bits = fp.readline().strip().split(" ")
n_nodes, n_bits = int(n_nodes), int(n_bits)

data = []
for row in fp.readlines():
    a = "".join(row.strip().split(" "))
    data.append(a)

solved = solve_p2(data, n_bits, n_nodes)

print ("P2 Solution", solved)
