# Huffman Encoding

Given the following code, develop a small app to compress and decompress a message using Huffman Encoding. Also, compute the actual effiiency of the compression. Ideally, you'll need at two principal methods:

* `compress(message)` and
* `decompress(message, encoding_tree)`

## `compress`

The *formal* signature of this method should be

```python
def compress(message: str) -> Tuple[str, Huffman_Node, float]:
```

If you are not familiar with [Python type hints](https://docs.python.org/3/library/typing.html), it means that method `compress` takes a string as an argument and returns a tuple comprising a string, Huffman_Node, and a real number. The returned string is the compressed message and the returned Huffman Node is the root of the encoding tree. The real number is the compression ratio.

The key and the value in the dictionary are both strings. A typical entry in that dictionary would look like:

The compression ratio is the size of the compressed message divided by the size of the message if it were left in ASCII code. The size, in ASCII code is always `8*len(message)` because there are 8 bits per ASCII character.

For example, here're the Huffman and ASCII encodings of `HELLO WORLD`:

```text
Huffman: 11001101101001010011101100111001010
  ASCII: 0100100001000101010011000100110001001111001000000101011101001111010100100100110001000100

```
There are 35 bits in the Huffman encoding and 88 in the ASCII. The compression ratio is $\dfrac{35}{88}$. In other words, Huffman compressed the original message to about 40% of its original size.


## `decompress`

The formal signature of this method is

```python
def decompress(message: str, encoding_tree: Huffman_Node) -> str:
```

The method takes a compressed message and its encoding tree, and restores it in ASCII form (the actual symbols, not their binary representation) for displaying.


## Actual efficiency

Given a message $M$ with $m >0$ characters, and its Huffman compression $N$ with $n>0$ characters, the compression ratio is defined as

$$ r = \frac{n}{m} $$

The better the compression, the closer $r$ gets to 0. A higher $r$ indicates poor compression.

The actual efficiency is the ratio

$$ r_\text{a} = \frac{n+t}{m} $$

where $t$ is the size of the encoding table. For relatively small values of $m$, we may get $r_\text{a} > 1$. Any value greater than 1 suggests that it's better to transmit the message in plain ASCII than attempt to compress it.

Consider the earlier example with `HELLO WORLD`. The Huffman encoding is 35 bits long, which is 40% of the ASCII encoding. That seems like a good compression. Until we realize that in addition to those 35 bits we also have to transmit the encoding table -- so whoever receives the compressed message can decompress it.

The Huffman encoding table for `HELLO WORLD` requires about storage for its keys (`H`, `E`, `L`, `O`, `' ' `, `W`, `R`, `D`) and its values. There are 8 keys, all ASCII symbols, requiring 8 bits each. The Huffman codes (`1100` for `H`, `1101` for `E`, etc) each require one byte as well. That's a total of 16 bytes or 128 bits. The actual efficiency is


$$ r_\text{a} = \frac{35+128}{88} = 1.85 $$

Realistically, the ratio is even higher, taking into consideration additional storage required to manage the data structure.

Modify your ``decompress`` method to return not only the decompressed message but also the actul efficiency of the compression. For $t$ assume 16 bits for each node in the Huffman tree.

Also, can you imagine/propose a way to ensure $r_\text{a} < 1 $ even for very small values of $m$?

## Testing

If you want to test your code with a very long message, consider a clear text version of a book from [Project Gutenberg](https://www.gutenberg.org/).


In [19]:
class Huffman_Node:

    def __init__(self, frequency, symbol=None):
        """
        Constructor for a new Huffman tree node.

        The node must always contain a frequency value. It may, or may not, contain
        a symbol value as well. As this is a node for a binary tree, it contains
        pointers to the left and the right children.
        """
        self.frequency = frequency
        self.symbol = symbol
        self.left = None
        self.right = None

    def __lt__(self, other):
        """Redefine < for comparison between Huffman_Node objects."""
        return self.frequency < other.frequency

    def postOrder(self):
        """ postOrder method to print the huffman encoded tree. """
        if self.left is not None:
            self.left.postOrder()
        if self.right is not None:
            self.right.postOrder()

        if self.symbol == " ":
            print(f"_={self.frequency} | ", end="")
        else:
            print(f"{self.symbol}={self.frequency} | ", end="")

    def countNodes(self):
        """ Method to count the number of nodes in the huffman tree """
        # Im assuming we count all nodes, including ones without symbols.
        # This is because they're also adding cost to the compression ratio.
        count = 1
        if self.left is not None:
            # recurse left and count the nodes
            count += self.left.countNodes()
        if self.right is not None:
            # recurse right and count the nodes
            count += self.right.countNodes()
        return count


In [20]:
def frequency_dict(string):
  """ Parse a string and compute frequencies of its characters. """
  frequencies = dict()
  if string is not None:
    for symbol in string:
      if symbol in frequencies:
        frequencies[symbol] +=1
      else:
        frequencies[symbol] = 1
  return frequencies

def least_frequency(array_of_nodes):
  """
  Find and remove from an array, the node with the least frequency.

  Given an array of nodes, we want to remove the node with the least frequency
  value. This is done with a plain min search across the array. The identified
  node is removed from the array and the array length is reduced by 1.
  """
  # Assume smallest node is the first one in the array
  smallest_index = 0
  smallest = array_of_nodes[smallest_index]
  # Traverse the remaining array looking for nodes smaller that the smallest one
  # using the "reprogrammed" less-than operator in Huffman_Node.
  for i in range(1, len(array_of_nodes)):
    if array_of_nodes[i] < smallest:
      # Found node smaller than the smallest one
      smallest_index = i
      smallest = array_of_nodes[smallest_index]
  return array_of_nodes.pop(smallest_index);

In [21]:
def huffman_encoding(message):
  """Produces the Huffman encoding tree for a given message."""
  # Obtain the frequencies for every symbol in the message
  frequencies = frequency_dict(message)
  # Initialize a list for all the symbol nodes
  forest = list()
  # Create a symbol node and add it to the forest
  for item in frequencies:
    forest.append(Huffman_Node(frequencies[item], item))
  # Forest ready for exploration. Repeatedly remove the two nodes with the
  # lowest frequency from the forest, combine their frequencies into a new node
  # make the removed nodes the new node's children, and add new node to forest.
  while len(forest) > 1:
    # Remove nodes with lowest frequency
    t1 = least_frequency(forest)
    t2 = least_frequency(forest)
    # Create new node with sum of frequencies of removed nodes
    new_node = Huffman_Node(t1.frequency+t2.frequency)
    # Make removed nodes the children of the new node
    new_node.left = t1
    new_node.right = t2
    # Add new node back to the forest
    forest.append(new_node)
  # Return the only node in the forest. Effectively, this is the root node
  # of the Huffman tree.
  return forest[0]

In [22]:
def create_encoded_dictionary(node, code="", encoded_dict={}) -> dict:
    """
    This method takes a root node for a huffman tree and returns a dictionary
    with all of the paths to each leaf node and its correspoding character.
    """
    # Base case, stop the recursion
    if node is None:
        return

    # If we've hit a leaf node / symbol. Add to the dictionary
    if node.symbol is not None:
        encoded_dict[node.symbol] = code

    # Recurse left and right, when we go left, we add 0 to the encoding
    # when we go right, we add 1 to the encoding
    create_encoded_dictionary(node.left,  code + "0", encoded_dict)
    create_encoded_dictionary(node.right, code + "1", encoded_dict)

    # Return the dictionary that holds all the keys and the encoding message
    return encoded_dict

def compress(message: str) -> tuple[str, Huffman_Node, float]:
    # Encode the message
    encoding = huffman_encoding(message)

    # Create the encoded dictionary, this holds the keys,
    # Which is a given symbol from the original string,
    # And the value is the encoded message for that specific string
    encoded_dict = create_encoded_dictionary(encoding)

    # Define encoded_message as a list, then convert to string later.
    # We do this because lists are mutable and strings are not, making
    # this better for memory
    encoded_message = []
    # Add all of the encoded strings to the list
    for char in message:
        encoded_message.append(encoded_dict[char])
    # Convert the list into a string
    encoded_message = ''.join(encoded_message)

    # Calculate the compression ratio, which is the size of the huffman
    # string, divided by the size of the ascii value
    r = len(encoded_message) / (8 * len(message))

    # return a tuple of the encoded message, the huffman encoding tree, and
    # the compression ratio
    return (encoded_message, encoding, r)

compress("HELLO WORLD")

('11101111101011000000111001010011',
 <__main__.Huffman_Node at 0x7c561038dc00>,
 0.36363636363636365)

In [13]:
def decompress(message: str, encoding_tree: Huffman_Node) -> str:
    """
    The method takes a compressed message and its encoding tree,
    and restores it in ASCII form (the actual symbols, not their
    binary representation) for displaying.

    I am assuming the encoding_tree is valid for the given string

    This returns the original message in english, and the actual
    compression ratio, including the tree
    """
    # return an empty if the message is empty
    if not message:
        return ""

    # Define the return variable, which will be the decoded message.
    # Will convert to a string later, I am using a list now because they
    # are mutable, whereas a string is not.
    decoded_message = []

    # Define the cursor, which will traverse the Huffman tree
    cursor = encoding_tree

    # Loop over each character in the encoded message,
    # this will be a '0' or a '1'
    for char in message:
        ## Based on the character, we know whether we need to go
        # left of right in the Huffman tree.
        # Go left
        if char == '0':
            cursor = cursor.left
        # Go right
        elif char == '1':
            cursor = cursor.right

        # Check if we've hit a leaf node, i.e. we found a character.
        if cursor.symbol is not None:
            # Add that character to the message
            decoded_message.append(cursor.symbol)
            # Reset pointer back to top of tree for next element
            cursor = encoding_tree

    # convert the decoded message to a string
    decoded_message = ''.join(decoded_message)

    # Get the ascii size of the message and encoding size, to
    # calculate the compression ratio
    ascii_size = 8 * len(decoded_message)
    encoding_size = len(message)

    # Assume 16 bits for each node in huffman
    t = 16 * encoding_tree.countNodes()

    # Calculate r, the compression ratio, according to the formula
    # if r > 1, it was wasteful to encode past the ascii encoding.
    r = (encoding_size + t) / ascii_size

    # return the decoded_message, and the true compression ratio
    return decoded_message, r

encoded_string, encoding_tree, _ = compress("HELLO WORLD")
decompress(encoded_string, encoding_tree)

('HELLO WORLD', 3.090909090909091)

In [25]:
# Sample use case
def main(message):
    # Compress the message
    encoded_message, encoding_tree, r = compress(message)

    # Print out the compressed message, its equivalent tree, and the compression ratio
    print("===========================")
    print(f"Encoded message =  {encoded_message}")
    print(f"Huffman encoding tree root node  {encoding_tree} \nHuffman encoding tree postOrder traversal")
    print(f"{encoding_tree.postOrder()}")
    print(f"compression ratio =  {r}")
    print("===========================")
    # Decompress the message, returning to its original form
    decompressed_m1, true_r = decompress(encoded_message, encoding_tree)
    print(f"decompressed message =  {decompressed_m1}")
    print(f"True compression ratio =  {true_r}")
    print("===========================\n")


# First use case
message_1 = "HELLO WORLD"
main(message_1)

# Second use case
message_2 = "A PERSON WHO THINKS ALL THE TIME HAS NOTHING TO THINK ABOUT EXCEPT THOUGHTS."\
            " SO, HE LOSES TOUCH WITH REALITY AND LIVES IN A WORLD OF ILLUSIONS."
main(message_2)

Encoded message =  11101111101011000000111001010011
Huffman encoding tree root node  <__main__.Huffman_Node object at 0x7c56103dbb80> 
Huffman encoding tree postOrder traversal
_=1 | W=1 | None=2 | R=1 | D=1 | None=2 | None=4 | L=3 | O=2 | H=1 | E=1 | None=2 | None=4 | None=7 | None=11 | None
compression ratio =  0.36363636363636365
decompressed message =  HELLO WORLD
True compression ratio =  3.090909090909091

Encoded message =  0011111010001100100100101011010101111001011100110111100011001011010101110010101110011011001101110001100100111100010111000010100111111000011101011101011101000110010110101011101111000110111100011001011010101110011100111000011110101001000111100110001000111101001010001000111000110011010100101110111000001010011111111101011011000101111110010011110110110110101001101011100011010100101111011001110010110110001100111001001001001101101011000100011011100110101100000111011010111000111100110101111011010111100111110010111010010001101000001111101010000111101101100110010011010

In regards to the question. "Can you imagine/propose a way to ensure $r_\text{a} < 1 $ even for very small values of $m$?"

Im not sure the best way to do this, but maybe having predefined values, that are smaller than ascii, so that you don't need to transport the table with it

---




---

### Appendix: Java equivalent of a binary tree node

If you wish to do your work in Java, the Huffman node is given below.


```java
public class HuffmanNode implements Comparable<HuffmanNode> {
  char symbol;
  int frequency;
  HuffmanNode left;
  HuffmanNode right;
  /** Basic constructor */
  public HuffmaNode(char symbol, int frequency) {
    this.symbol = symbol;
    this.frequency = frequency;
    this.left = null;
    this.right = null;
  }
  /* Even more basic constructor */
  public HuffmanNode(int frequency) {
    this(0,frequency);
  }
  // Mutators
  public void setLeft(HuffmanNode left) { this.left = left; }
  public void setRight(HuffmanNode right) { this.right = right; }
  // Accessors
  public HuffmanNode getLeft() { return this.left; }
  public HuffmanNode getRight() { return this.right; }
  public int getFrequency() { return this.frequency; }
  public char getSymbol() { return this.symbol; }
  // Comparable
  public int compareTo(HuffmanNode o) { return this.frequency - o.getFrequency(); }
} // class HuffmanNode
```