In [1]:
import time
import numpy as np
import itertools
np.set_printoptions(suppress=True)

In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing 
k
k-clustering.

Download the text file below.

This file describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]

[edge 1 node 1] [edge 1 node 2] [edge 1 cost]

[edge 2 node 1] [edge 2 node 2] [edge 2 cost]

...

There is one edge 
(
i
,
j
)
(i,j) for each choice of 
1
≤
i
<
j
≤
n
1≤i<j≤n, where 
n
n is the number of nodes.

For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number 
k
k of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

In [2]:
class UnionFind:
    """Weighted quick-union with path compression.
    The original Java implementation is introduced at
    https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
    >>> uf = UnionFind(10)
    >>> for (p, q) in [(3, 4), (4, 9), (8, 0), (2, 3), (5, 6), (5, 9),
    ...                (7, 3), (4, 8), (6, 1)]:
    ...     uf.union(p, q)
    >>> uf._parent
    [8, 3, 3, 3, 3, 3, 3, 3, 3, 3]
    >>> uf.find(0, 1)
    True
    >>> uf._parent
    [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
    """
        
    def __init__(self,n):
        self._parent = list(range(n))
        self._sz = [1]*n
        self.n_comps = n
    
    def find(self,i):
        j = i
        while (j != self._parent[j]): #if it is not root
            self._parent[j] = self._parent[self._parent[j]] #install shortcut to the grandparents
            j = self._parent[j]
        return j

    def connected(self,p,q):
        return self.find(p) == self.find(q)
    
    def union(self,p,q):
        i = self.find(p)
        j = self.find(q)
        if i == j:
            return
        elif (self._sz[i] < self._sz[j]):
            self._parent[i] = j
            self._sz[j] += self._sz[i]
            self.n_comps -=1
        else:
            self._parent[j] = i
            self._sz[i] += self._sz[j]
            self.n_comps -=1

In [3]:
def clustering(filename,k):
    first = True
    with open('week10_file/'+filename) as f:
        graph = []
        for line in f:
            line = line.split() # to deal with blank 
            if line and not first:            # lines (ie skip them)
                v = int(line[0])
                w = int(line[1])
                cost = int(line[2])
                graph.append([v,w,cost])
            else:
                num_nodes = int(line[0])
                first = False

    graph = sorted(graph,key=lambda t: t[2])
    uf = UnionFind(num_nodes+1)
    counter = 0
    while (uf.n_comps-1) != k: #index of UF start from 1
        p,q,cost = graph[counter]
        uf.union(p,q)
        counter+=1
    
    for i in range(counter,len(graph)):
        p,q,cost = graph[i]
        if uf.connected(p,q) == False:
            spacing = cost
            break
            
    return spacing

In [4]:
start_time = time.time()
print(clustering('week10_1_test1.txt',2)) #5
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
print(clustering('week10_1_test1.txt',3)) #2
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
print(clustering('week10_1_test1.txt',5)) #1
print("--- %s seconds ---" % (time.time() - start_time))

5
--- 0.003537893295288086 seconds ---
2
--- 0.0013532638549804688 seconds ---
1
--- 0.0008089542388916016 seconds ---


In [5]:
start_time = time.time()
print(clustering('week10_1_test2.txt',2)) #8
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
print(clustering('week10_1_test2.txt',3)) #4
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
print(clustering('week10_1_test2.txt',4)) #1
print("--- %s seconds ---" % (time.time() - start_time))

8
--- 0.00238800048828125 seconds ---
4
--- 0.0005519390106201172 seconds ---
1
--- 0.0012269020080566406 seconds ---


In [6]:
start_time = time.time()
print(clustering('week10_1_test3.txt',2)) #100
print("--- %s seconds ---" % (time.time() - start_time))

100
--- 0.0018990039825439453 seconds ---


In [7]:
start_time = time.time()
print(clustering('week10_1.txt',4)) #106
print("--- %s seconds ---" % (time.time() - start_time))

106
--- 0.8143727779388428 seconds ---


In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The data set is below.

The format is:

[# of nodes] [# of bits for each node's label]

[first bit of node 1] ... [last bit of node 1]

[first bit of node 2] ... [last bit of node 2]

...

For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes 
u
u and 
v
v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of 
k
k such that there is a 
k
k-clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?



In [403]:
def generate_bit_recursive(bit,i,changesLeft,generated_bit,G):
    if changesLeft == 0:
        bit_str = ''.join(bit)
        if G.get(bit_str) != None:
            generated_bit.append(bit_str)
        return generated_bit
    
    if i < 0:
        return generated_bit
    
    if bit[i] == '0':
        bit[i] = '1'
    else:
        bit[i] = '0'
    
    generated_bit = generate_bit_recursive(bit,i-1,changesLeft-1,generated_bit,G)
    
    if bit[i] == '0':
        bit[i] = '1'
    else:
        bit[i] = '0'
    generated_bit = generate_bit_recursive(bit,i-1,changesLeft,generated_bit,G)
    
    return generated_bit
    

In [404]:
def clustering_big(filename,min_dist):
    first = True
    vertex_no = 0
    G = {}
    with open('week10_file/'+filename) as f:
        for line in f:
            if line and not first:            # lines (ie skip them)
                line = line.strip()
                bit = line.replace(" ", "")
                if G.get(bit) == None:
                    G[bit] = vertex_no
                else:
                    uf.union(vertex_no,G[bit])
                vertex_no +=1
            else:
                line = line.split()
                num_nodes = int(line[0])
                num_bits = int(line[1])
                uf = UnionFind(num_nodes) 
                first = False

    for bit in G.keys():
        for i in range(min_dist):
            result = generate_bit_recursive(list(bit),len(bit)-1,i,[],G)
            for bit_generated in result:
                if G.get(bit_generated) != None:
                    uf.union(G[bit],G[bit_generated])          
    return uf.n_comps

In [405]:
start_time = time.time()
print(clustering_big('week10_2_test1.txt',3)) #2
print("--- %s seconds ---" % (time.time() - start_time))

2
--- 0.00949406623840332 seconds ---


In [406]:
start_time = time.time()
print(clustering_big('week10_2_test2.txt',3)) #6
print("--- %s seconds ---" % (time.time() - start_time))

6
--- 0.010624885559082031 seconds ---


In [407]:
start_time = time.time()
print(clustering_big('week10_2.txt',3)) #6118
print("--- %s seconds ---" % (time.time() - start_time))

6118
--- 107.31809592247009 seconds ---


In [None]:
import itertools

In [178]:
for pair in itertools.combinations('001', 2):
        bit_1, bit_2 = pair
        print(bit_1,bit_2)

0 0
0 1
0 1


In [174]:
tuple('001')

('0', '0', '1')

In [175]:
tuple(001)

SyntaxError: invalid token (<ipython-input-175-7b9d5caf2546>, line 1)

In [176]:
a = tuple('001')

In [177]:
a[1] = 1

TypeError: 'tuple' object does not support item assignment

In [184]:
a = list('001')

In [188]:
a[1] ='1'

In [189]:
a

['0', '1', '1']

In [191]:
''.join(a)

'011'

In [213]:
0 ^ 0

0

In [214]:
1 ^ 1

0

In [220]:
a = "110"
b = "100"
y = int(a, 2)^int(b,2)
print(bin(y)[2:].zfill(len(a)))

010


In [382]:
def precompute_bit(length,max_dist):
    precomputed_bit = []
    for i in itertools.product(['0', '1'], repeat=length):
        if sum(list(map(int,i))) <= max_dist:
            bit_str = ''.join(i)
            precomputed_bit.append(int(bit_str,2))
    return precomputed_bit

In [390]:
def generate_bit_iterative(bit,max_dist,precomputed_bit):
    generated_bit = np.bitwise_xor(precomputed_bit,bit)
    #generated_bit = [np.binary_repr(i,width=len_bit) for i in generated_bit]
    #generated_bit.append(bin(xor_result)[2:].zfill(len_bit))
            
    return generated_bit
    

In [426]:
def clustering_big_2(filename,min_dist):
    first = True
    vertex_no = 0
    G = {}
    with open('week10_file/'+filename) as f:
        for line in f:
            if line and not first:            # lines (ie skip them)
                line = line.strip()
                bit = line.replace(" ", "")
                bit = int(bit,2)
                if G.get(bit) == None:
                    G[bit] = vertex_no
                else:
                    uf.union(vertex_no,G[bit])
                vertex_no +=1
            else:
                line = line.split()
                num_nodes = int(line[0])
                num_bits = int(line[1])
                uf = UnionFind(num_nodes) 
                first = False
                
    precomputed_bit = precompute_bit(num_bits,min_dist-1)
    for bit in G.keys():
        for i in range(min_dist):
            result = generate_bit_iterative(bit,min_dist-1,precomputed_bit)
            for bit_generated in result:
                if G.get(bit_generated) != None:
                    uf.union(G[bit],G[bit_generated])          
    return uf.n_comps

In [429]:
start_time = time.time()
print(clustering_big_2('week10_2_test2.txt',3)) #6118
print("--- %s seconds ---" % (time.time() - start_time))

6
--- 90.89142441749573 seconds ---


In [331]:
def precompute_bit_2(length,max_dist):
    precomputed_bit = []
    for i in itertools.product([0, 1], repeat=length):
        if sum(i) <= max_dist:
            precomputed_bit.append(list(i))
    return precomputed_bit

In [360]:
precompute_bit(4,2)

0
1
2
3
4
5
6
8
9
10
12


['0000',
 '0001',
 '0010',
 '0011',
 '0100',
 '0101',
 '0110',
 '1000',
 '1001',
 '1010',
 '1100']

In [None]:
def generate_bit_iterative_2(bit,max_dist):
    bit_bin = int(bit,2)
    len_bit = len(bit)
    precomputed_bit = precompute_bit(len(bit),max_dist)
    generated_bit = [ for bit in precomputed_bit]
    #generated_bit.append(bin(xor_result)[2:].zfill(len_bit))
            
    return generated_bit
    

In [351]:
def generate_bit_iterative(bit,max_dist,precomputed_bit):
    bit_bin = int(bit,2)
    len_bit = len(bit)
    print(precomputed_bit)
    generated_bit = np.binary_repr(np.bitwise_xor(precomputed_bit,bit_bin))
    #generated_bit.append(bin(xor_result)[2:].zfill(len_bit))
            
    return generated_bit
    

In [352]:
start_time = time.time()
for i in range(100000):
    precomputed_bit = precompute_bit(len('011'),2)
    generate_bit_iterative('011',2,precomputed_bit)
print("--- %s seconds ---" % (time.time() - start_time))

['000', '001', '010', '011', '100', '101', '110']


TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [346]:
start_time = time.time()
for i in range(100000):
    generate_bit_recursive(list('011'),2,2,[],{})
    generate_bit_recursive(list('011'),2,1,[],{})
    
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.928779125213623 seconds ---


In [265]:
lst = [i for i in itertools.product([0, 1], repeat=3) if sum(i) == 1 or sum (i) == 2 ]

In [266]:
lst

[(0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0)]

In [254]:
rest = list(map(str,lst))

In [255]:
rest

['[0, 0, 1]', '[0, 1, 0]', '[0, 1, 1]', '[1, 0, 0]', '[1, 0, 1]', '[1, 1, 0]']

In [263]:
int('100',2)

4

In [264]:
int('110',2)

6

In [274]:
1^0

1

In [275]:
1^1

0

In [276]:
0^0

0

In [277]:
1^0

1

In [333]:
[0,0,0]^[1,2,3]

TypeError: unsupported operand type(s) for ^: 'list' and 'list'

In [348]:
001^001

SyntaxError: invalid token (<ipython-input-348-a784e0424a91>, line 1)