# Union Find Data Structure

## Two operations
1. FIND($X$): Return name of group that $X$ belongs to.
2. UNION($p$, $q$): Fuse groups $p$, $q$ into a single one.

## Eager approach
1. **Invariants**: Each vertex points to the leader of its component.
2. Data structure:
  - a list for storing leaders (```leaders```) of each point
  - $p$ and $q$ are connected if they have the same leader.
3. FIND($X$): Return ```leaders```[$X$].
4. UNION($p$, $q$): Change **all** entries with ```leaders```[$p$] to ```leaders```[$q$].<br>
  For example: Merger 1 and 4, in array representation: 1 1 1 4 4 4 $\rightarrow$ 4 4 4 4 4 4

## Lazy union
1. Update only **one** pointer each merge.<br>
  For example: Merge 1 and 4, in array representation: 1 1 1 4 4 4 $\rightarrow$ 4 1 1 4 4 4
2. UNION reduces to 2 FINDS: link FIND($p$) and FIND($q$).
3. FIND operation needs to follow a path of parent pointers until finds the root.

## Union by rank
1. Purpose: to prevent the tree becomes to tall.
2. Rank: ```rank```[$x$] = 1 + (max rank of $x$’s children) $\Rightarrow$ rank of leaf is 0.
3. **Invariant**: ```rank```[$x$] = maximum number of hops from some leaf to $x$.
4. UNION($p$, $q$):
  ```python
  s1 = FIND(p), s2 = FIND(q)
  if rank[s1] > rank[s2]:
      parent[s2] = s1
  else:
      parent[s1] = s2
  
  # to restore the invariance
  if rank[s1] == rank[s2]:
      rank[s2] += 1
  ```
5. Worst-case running time of FIND is ${\cal O}(\log n)$ (From rank lemma: there are at most $n/2^r$ objects with rank $r$).

## Path Compression
1. After FIND($x$), revise parent pointers to $x$’s root all along the path from $x$ to ```root```.
2. Maintain ```rank``` **EXACTLY** as without path compression. Now ```rank```[$x$] is only an upper bound on the maximum number of hops on a path from a leaf to $x$.
3. Running time:<br>
  With Union by Rank and path compression, $m$ Union $+$ Find operations take ${\cal O}(m \log^∗ n)$ time. **[Hopcroft--Ullman]**<br>
  With Union by Rank and path compression, $m$ Union $+$ Find operations take ${\cal O}[m \alpha(n)]$ time, where $\alpha(n)$ is the inverse Ackerman function. **[Tarjan]**

In [1]:
class UnionFind:
    """ The union-find data structure.
    Union by rank and path conpression.
    """
    
    def __init__(self, n):
        """ Initialize each node point to itself """
        self.parents = [i for i in range(n)]
        self.rank = [0] * n
        self.size = n
        return
    
    def __repr__(self):
        out = "{0:>10s}  {1:>10s}  {2:>10s}\n".format("Node", "Parent", "Rank")
        for i in range(self.size):
            out += "{0:10d}  {1:10d}  {2:10d}\n".format(i, self.parents[i],
                                                        self.rank[i])
        return out
        
    def find(self, x):
        """ The FIND operation.
        Return name of group that x belongs to.
        """
        visited = []
        while x != self.parents[x]:
            visited.append(x)
            x = self.parents[x]
        for i in visited:
            self.parents[i] = x
        return x
    
    def union(self, p, q):
        """ The UNION operation.
        Update the leaders of group p or q.
        """
        s1, s2 = self.find(p), self.find(q)
        if self.rank[s1] > self.rank[s2]:
            self.parents[s2] = s1
        else:
            self.parents[s1] = s2
            if self.rank[s1] == self.rank[s2]:
                self.rank[s2] += 1
        return

In [2]:
# test union find
uf = UnionFind(10)
print uf

uf.union(1,4)
uf.union(4,6)
uf.union(6,7)
uf.union(2,5)
uf.union(3,5)
uf.union(7,5)
print uf

print "root of node 1: ", uf.find(1)
print uf

      Node      Parent        Rank
         0           0           0
         1           1           0
         2           2           0
         3           3           0
         4           4           0
         5           5           0
         6           6           0
         7           7           0
         8           8           0
         9           9           0

      Node      Parent        Rank
         0           0           0
         1           4           0
         2           5           0
         3           5           0
         4           5           1
         5           5           2
         6           4           0
         7           4           0
         8           8           0
         9           9           0

root of node 1:  5
      Node      Parent        Rank
         0           0           0
         1           5           0
         2           5           0
         3           5           0
         4           5           1

# Problem 1

In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing $k$-clustering.

This file describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]

[edge 1 node 1] [edge 1 node 2] [edge 1 cost]

[edge 2 node 1] [edge 2 node 2] [edge 2 cost]

...

There is one edge $(i,j)$ for each choice of $1 \leq i < j \leq n$, where $n$ is the number of nodes.

For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 [equivalently, the cost of the edge (1,3)] is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number $k$ of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

In [3]:
DEBUG = 2

def readfile1(filename):
    """ Read edge info from file.
    Make the node number starting from zero to adapt the union-find structure.
    """
    n = None
    X = []
    for line in open(filename, 'r'):
        try:
            n = int(line)
        except:
            p, q, cost = map(int, line.split())
            X.append((p - 1, q - 1, cost))
    
    if DEBUG > 1:
        print "Edges read from file {0}:".format(filename)
        print X
    
    return n, X

def clustering1(k, n, Xsorted):
    """ Modified Kruskal’s algorithm for clustering.
    Return the maximum spacing -- min distance between p and q for p, q not in same cluster
    k -- number of clusters
    n -- number of nodes in Xsorted
    Xsorted -- a list of triplets (node1, node2, cost) sorted by cost
    """
    uf = UnionFind(n)
    while n > k:
        p, q, cost = Xsorted.pop(0)
        while uf.find(p) == uf.find(q):
            p, q, cost = Xsorted.pop(0)
        uf.union(p, q)
        n -= 1
    
    # At this point we already have k clusters,
    # but the maximum spacing may not be the shortest edge in Xsorted so far
    # because the two nodes may belong to a same cluster
    p, q, cost = Xsorted.pop(0)
    while uf.find(p) == uf.find(q):
        p, q, cost = Xsorted.pop(0)
    
    if DEBUG > 0:
        print "Union-Find after clustering:"
        print uf
    
    return cost

def main1(filename, k):
    """ Sequence of functions for small clustering problem. """
    n, X = readfile1(filename)
    Xsorted = sorted(X, key = lambda x: x[2])
    if DEBUG > 1:
        print "Sorted edges:"
        print Xsorted
    maxspacing = clustering1(k, n, Xsorted)
    print "Max spacing of {0} is {1}.".format(filename, maxspacing)
    return maxspacing

In [4]:
# test case: 21
assert main1("test1.txt", 4) == 21, "main1 does not pass the test!"
print "main1 passes the test!"

Edges read from file test1.txt:
[(0, 1, 32), (0, 2, 46), (0, 3, 50), (0, 4, 57), (0, 5, 57), (0, 6, 32), (0, 7, 51), (1, 2, 50), (1, 3, 35), (1, 4, 1), (1, 5, 17), (1, 6, 56), (1, 7, 19), (2, 3, 21), (2, 4, 22), (2, 5, 42), (2, 6, 29), (2, 7, 44), (3, 4, 27), (3, 5, 38), (3, 6, 25), (3, 7, 18), (4, 5, 6), (4, 6, 53), (4, 7, 9), (5, 6, 27), (5, 7, 22), (6, 7, 46)]
Sorted edges:
[(1, 4, 1), (4, 5, 6), (4, 7, 9), (1, 5, 17), (3, 7, 18), (1, 7, 19), (2, 3, 21), (2, 4, 22), (5, 7, 22), (3, 6, 25), (3, 4, 27), (5, 6, 27), (2, 6, 29), (0, 1, 32), (0, 6, 32), (1, 3, 35), (3, 5, 38), (2, 5, 42), (2, 7, 44), (0, 2, 46), (6, 7, 46), (0, 3, 50), (1, 2, 50), (0, 7, 51), (4, 6, 53), (1, 6, 56), (0, 4, 57), (0, 5, 57)]
Union-Find after clustering:
      Node      Parent        Rank
         0           0           0
         1           4           0
         2           2           0
         3           4           0
         4           4           1
         5           4           0
         6  

In [None]:
DEBUG = 0
spacing1 = main1("clustering.txt", 4)

# Problem 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The format of the file is:

[# of nodes] [# of bits for each node's label]

[first bit of node 1] ... [last bit of node 1]

[first bit of node 2] ... [last bit of node 2]

...

For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes $u$ and $v$ in this problem is defined as the <i>Hamming distance</i> --- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of $k$ such that there is a $k$-clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

In [5]:
def readfile2(filename):
    """ Read from file and return a cost-to-node map.
    NOTE: bits can be the same for different nodes!!!
    """
    X = {}
    n, nbits = None, None
    with open(filename, 'r') as f:
        for linenum, content in enumerate(f):
            if linenum == 0:
                n, nbits = map(int, content.split())
            else:
                bitstring = "".join(content.split())
                try:
                    X[bitstring].append(linenum - 1)
                except:
                    X[bitstring] = [linenum - 1]
    
    repeated = []
    for bs in X.keys():
        if len(X[bs]) > 1:
            repeated.append(X[bs])
            X[bs] = repeated[-1][0]
        else:
            X[bs] = X[bs][0]
    
    if DEBUG > 1:
        print "cost-to-node map from {0}:".format(filename)
        print X
    if DEBUG > 0:
        print "repeated entries from {0}:".format(filename)
        print repeated
    
    return n, nbits, X, repeated

def flip_1bit(bitstring, nbits):
    """ Return a list of bit strings with Hamming distance 1 from the input bigstring. """
    out = []
    for i in range(nbits):
        bs = bitstring[:i] + str(1 ^ int(bitstring[i])) + bitstring[(i + 1):]
        out.append(bs)
    return out

def flip_2bit(bitstring, nbits):
    """ Return a list of bit strings with Hamming distance 2 from the input bitstring. """
    out = []
    for i in range(nbits):
        for j in range(i + 1, nbits):
            bs = list(bitstring)
            bi = str(1 ^ int(bitstring[i]))
            bj = str(1 ^ int(bitstring[j]))
            bs[i], bs[j] = bi, bj
            out.append("".join(bs))
    return out

def compute_hamming_distance(bs1, bs2):
    """ Return the Hamming distrance between two bit strings. """
    l1, l2 = len(bs1), len(bs2)
    if l1 != l2:
        raise ValueError("Cannot compute Hamming distance for two bit strings of different lengths.")
    
    dist = 0
    for i in range(l1):
        dist += int(bs1[i]) ^ int(bs2[i])
    
    return dist

def clustering2(n, nbits, X, repeated):
    """ Compute k-clustering with spacing at least 3.
    Return -- k
    n -- number of nodes in X
    nbits -- number of bits associated to each node
    X -- a cost-to-node map
    """
    
    ncluster = n
    uf = UnionFind(n)
    
    # first fuse all nodes in the repeated list
    for l in repeated:
        for v in l[1:]:
            uf.union(l[0], v)
            ncluster -= 1
    
    # loop over bits associated to a node
    for bits in X.keys():
        v1 = X[bits]
        
        flipped_bits = flip_1bit(bits, nbits)
        flipped_bits.extend(flip_2bit(bits, nbits))
        
        for fbs in flipped_bits:
            try:
                v2 = X[fbs]
                if uf.find(v1) != uf.find(v2):
                    uf.union(v1, v2)
                    ncluster -= 1
            except:
                pass
    
    return ncluster

In [6]:
# small test case: 11
DEBUG = 2
n, nbits, X, repeated = readfile2("test2.txt")
k = clustering2(n, nbits, X, repeated)
assert k == 11, "clustering2 does not pass test2!"
print "clustering2 passes test2!"

# bigger test case: 127
DEBUG = 1
n, nbits, X, repeated = readfile2("test3.txt")
k = clustering2(n, nbits, X, repeated)
assert k == 127, "clustering2 does not pass test3!"
print "clustering2 passes test3!"

cost-to-node map from test2.txt:
{'1111111110': 9, '0100011011': 8, '1111001001': 2, '1010001110': 0, '0010010101': 13, '1000111000': 14, '1110100111': 12, '1101111101': 7, '0100001001': 6, '0000010000': 10, '1100101001': 3, '0011011101': 1, '0001110110': 5, '1011101100': 4, '1111111001': 11, '1101000110': 15}
repeated entries from test2.txt:
[]
clustering2 passes test2!
repeated entries from test3.txt:
[]
clustering2 passes test3!


In [7]:
# timer grabbed from 
# https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python
from timeit import default_timer as timer
class benchmark(object):
    def __init__(self, msg, fmt="%0.3g"):
        self.msg = msg
        self.fmt = fmt

    def __enter__(self):
        self.start = timer()
        return self

    def __exit__(self, *args):
        t = timer() - self.start
        print(("%s : " + self.fmt + " seconds") % (self.msg, t))
        self.time = t

In [8]:
DEBUG = 0

with benchmark("Read file clustering_big.txt") as r:
    n, nbits, X, repeated = readfile2("clustering_big.txt")
    print "Number of repeated entries: {0}".format(len(repeated))

with benchmark("Clustering 2") as r:
    k = clustering2(n, nbits, X, repeated)

print "kmax with spacing at least 3: {0}".format(k)

Number of repeated entries: 1209
Read file clustering_big.txt : 0.894 seconds
Clustering 2 : 251 seconds
