This file describes an instance of the problem. It has the following format:

[number_of_symbols]

[weight of symbol #1]

[weight of symbol #2]

...

For example, the third line of the file is "6852892," indicating that the weight of the second symbol of the alphabet is 6852892. (We're using weights instead of frequencies, like in the "A More Complex Example" video.)

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

## Problem 1
Your task in this problem is to run the Huffman coding algorithm from lecture on this data set. What is the maximum length of a codeword in the resulting Huffman code?

## Problem 2
Continuing the previous problem, what is the minimum length of a codeword in your Huffman code?

# Huffman Coding

## Input
Probability $p_i$ for each character $i \in \Sigma$, where $\Sigma$ is the set of characters.

## Output
A binary tree $T$ (with leaves $\leftrightarrow$ symbols of $\Sigma$) minimizing the average encoding length:
\begin{align}
  L(T) = \sum_{i \in \Sigma} p_i [\text{depth of } i \text{ in } T].
\end{align}

## Algorithm
**Greedy heuristic**: merge the two symbols with the smallest frequencies.
```
def Huffman_coding(frequencies p_i):
  if size of S == 2: return
  
  while size of S > 1
    a, b = characters with the smallest frequencies
    S = S with a, b replaced by combined symbol ab
    p_ab = p_a + p_b
    frequencies = frequencies with p_a, p_b replaced by p_ab
    Tprime += node(ab, a, b)
  
  T = traverse Tprime and extend from root to leaves
  return T
```

## Implementation
**Heap**: put the frequencies into a heap.<br><br>
**Sorted array**:
```
1. sort the original frequencies array to increasing order.
2. initialize an empty array for frequencies of combined symbols.
3. while there is more than one node in the queues:
     3-1. Dequeue two nodes (a, b) with the lowest weights among both queues.
     3-2. Create an internal node (ab) and its weight p_ab = p_a + p_b.
     3-3. Append p_ab to the array initalized in step 2.
4. The remaining node is the root node.
5. Traverse the tree and print coding.
```

In [1]:
import types
NumberTypes = (types.IntType, types.LongType, types.FloatType, types.ComplexType)

class Node:
    """ Simple class to represent a node. """
    def __init__(self, name, weight):
        if not isinstance(weight, NumberTypes):
            raise ValueError("Weight of a Node must be a number")
        
        self.name = name
        self.weight = weight
        self.left = None
        self.right = None
        return
    
    def __repr__(self):
        lname = "None" if self.left is None else self.left.name
        rname = "None" if self.right is None else self.right.name
        return "Node: {0}, Weight: {1}, Left: {2}, Right: {3}".format(self.name,
                                                                      self.weight,
                                                                      lname, rname)
    
    def __eq__(self, other):
        return self.weight == other.weight
    
    def __ne__(self, other):
        return self.weight != other.weight
    
    def __lt__(self, other):
        return self.weight < other.weight
    
    def __le__(self, other):
        return self.weight <= other.weight
    
    def __gt__(self, other):
        return self.weight > other.weight
    
    def __ge__(self, other):
        return self.weight >= other.weight
    
#     def __cmp__(self, other):
#         if self.weight > other.weight:
#             return 1
#         elif self.weight == other.weight:
#             return 0
#         else:
#             return -1
    
    def set_left(self, left):
        self.left = left
        return
    
    def set_right(self, right):
        self.right = right
        return
    
    def set_children(self, left, right):
        self.left = left
        self.right = right
        return

In [2]:
DEBUG = 2

import heapq
def readfile(filename):
    n = None
    data = []
    for linenum, line in enumerate(open(filename, 'r')):
        line = line.strip()
        if linenum == 0:
            n = int(line)
        else:
            data.append(Node(linenum, int(line)))
    
    if DEBUG > 1:
        print "Read from file {0}".format(filename)
        for i in data:
            print i
    
    return n, data

def print_Huffman_tree(node, addup_coding, codingmap):
    if node is None:
        return
    
    if node.name != "i":
        codingmap[node.weight] = addup_coding
    
    print_Huffman_tree(node.left, addup_coding + "0", codingmap)
    print_Huffman_tree(node.right, addup_coding + "1", codingmap)

def Huffman_coding_heap(data):
    heapq.heapify(data)

    while len(data) > 1:
        left = heapq.heappop(data)
        right = heapq.heappop(data)
        w = left.weight + right.weight
        node = Node("i", w)
        node.set_children(left, right)
        heapq.heappush(data, node)
    
    codingmap = {}
    print_Huffman_tree(data[0], "", codingmap)
    
    if DEBUG > 1:
        print "Weight-to-coding map from Huffman_coding_heap:"
        print codingmap
    
    return codingmap

def analyze_codingmap(codingmap):
    """ Return the max and min length of Huffman coding. """
    max_node = max(codingmap, key = lambda x: len(codingmap[x]))
    min_node = min(codingmap, key = lambda x: len(codingmap[x]))
    return len(codingmap[max_node]), len(codingmap[min_node])

In [3]:
# test case: max--6, min--3
n, data = readfile('test.txt')
codingmap = Huffman_coding_heap(data)
max_len, min_len = analyze_codingmap(codingmap)
print "maximum length of a codeword = {0}".format(max_len)
print "minimum length of a codeword = {0}".format(min_len)

Read from file test.txt
Node: 1, Weight: 895, Left: None, Right: None
Node: 2, Weight: 121, Left: None, Right: None
Node: 3, Weight: 188, Left: None, Right: None
Node: 4, Weight: 953, Left: None, Right: None
Node: 5, Weight: 378, Left: None, Right: None
Node: 6, Weight: 849, Left: None, Right: None
Node: 7, Weight: 153, Left: None, Right: None
Node: 8, Weight: 579, Left: None, Right: None
Node: 9, Weight: 144, Left: None, Right: None
Node: 10, Weight: 727, Left: None, Right: None
Node: 11, Weight: 589, Left: None, Right: None
Node: 12, Weight: 301, Left: None, Right: None
Node: 13, Weight: 442, Left: None, Right: None
Node: 14, Weight: 327, Left: None, Right: None
Node: 15, Weight: 930, Left: None, Right: None
Weight-to-coding map from Huffman_coding_heap:
{930: '100', 579: '1101', 327: '11110', 301: '11001', 589: '1110', 144: '110001', 849: '010', 121: '110000', 727: '000', 153: '111110', 953: '101', 378: '0010', 188: '111111', 442: '0011', 895: '011'}
maximum length of a codeword = 6

In [4]:
# timer grabbed from 
# https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python
from timeit import default_timer as timer
class benchmark(object):
    def __init__(self, msg, fmt="%0.3g"):
        self.msg = msg
        self.fmt = fmt

    def __enter__(self):
        self.start = timer()
        return self

    def __exit__(self, *args):
        t = timer() - self.start
        print(("%s : " + self.fmt + " seconds") % (self.msg, t))
        self.time = t

In [5]:
DEBUG = 0
n, data = readfile('huffman.txt')
with benchmark("Huffman coding using heap") as r:
    codingmap = Huffman_coding_heap(data)
max_len, min_len = analyze_codingmap(codingmap)
# print "maximum length of a codeword = {0}".format(max_len)
# print "minimum length of a codeword = {0}".format(min_len)

Huffman coding using heap : 0.0183 seconds


In [6]:
def find_min(data, dlen, secondary, slen):
    if dlen == 0:
        slen -= 1
        return secondary.pop(0), dlen, slen
    
    if slen == 0:
        dlen -= 1
        return data.pop(0), dlen, slen
    
    if data[0] < secondary[0]:
        dlen -= 1
        return data.pop(0), dlen, slen
    else:
        slen -= 1
        return secondary.pop(0), dlen, slen

def Huffman_coding_sorted(n, data):
    data = sorted(data)
    secondary = []
    
    dlen, slen = n, 0
    while dlen + slen > 1:
        left, dlen, slen = find_min(data, dlen, secondary, slen)
        right, dlen, slen = find_min(data, dlen, secondary, slen)
        w = left.weight + right.weight
        node = Node("i", w)
        node.set_children(left, right)
        secondary.append(node)
        slen += 1
    
    codingmap = {}
    print_Huffman_tree(secondary[0], "", codingmap)
    
    if DEBUG > 1:
        print "Weight-to-coding map from Huffman_coding_heap:"
        print codingmap
    
    return codingmap

In [7]:
# test case: max--6, min--3
DEBUG = 2
n, data = readfile('test.txt')
codingmap = Huffman_coding_sorted(n, data)
max_len, min_len = analyze_codingmap(codingmap)
print "maximum length of a codeword = {0}".format(max_len)
print "minimum length of a codeword = {0}".format(min_len)

Read from file test.txt
Node: 1, Weight: 895, Left: None, Right: None
Node: 2, Weight: 121, Left: None, Right: None
Node: 3, Weight: 188, Left: None, Right: None
Node: 4, Weight: 953, Left: None, Right: None
Node: 5, Weight: 378, Left: None, Right: None
Node: 6, Weight: 849, Left: None, Right: None
Node: 7, Weight: 153, Left: None, Right: None
Node: 8, Weight: 579, Left: None, Right: None
Node: 9, Weight: 144, Left: None, Right: None
Node: 10, Weight: 727, Left: None, Right: None
Node: 11, Weight: 589, Left: None, Right: None
Node: 12, Weight: 301, Left: None, Right: None
Node: 13, Weight: 442, Left: None, Right: None
Node: 14, Weight: 327, Left: None, Right: None
Node: 15, Weight: 930, Left: None, Right: None
Weight-to-coding map from Huffman_coding_heap:
{930: '100', 579: '1101', 327: '11110', 301: '11001', 589: '1110', 144: '110001', 849: '010', 121: '110000', 727: '000', 153: '111110', 953: '101', 378: '0010', 188: '111111', 442: '0011', 895: '011'}
maximum length of a codeword = 6

In [8]:
DEBUG = 0
n, data = readfile('huffman.txt')
with benchmark("Huffman coding using sorted array") as r:
    codingmap = Huffman_coding_sorted(n, data)
max_len, min_len = analyze_codingmap(codingmap)
# print "maximum length of a codeword = {0}".format(max_len)
# print "minimum length of a codeword = {0}".format(min_len)

Huffman coding using sorted array : 0.0149 seconds
