# Assignment 06: Huffman

## Reading

- [Greedy algorithms](https://jeffe.cs.illinois.edu/teaching/algorithms/book/04-greedy.pdf) (chapter 4) in Jeff Erikson's book. <br/>
  Huffman encoding is a good example of a greedy algorithm that happens to work. As we discussed, greedy algorithms rarely work by themselves. The strategy often requires some fortification in the form of dynamic programming. For now, we are grateful that this particular technique works.

- Leo's [slide deck about Huffman encoding](https://docs.google.com/presentation/d/1kSXEB7mzumoUm4pw7dhtxJxX7xckzjpUDyWlGfAxzAI/edit?usp=sharing). (Ignore slides 47 and higher; still working on them).

- [Ken Huffman's tribute to his uncle David Huffman](https://www.huffmancoding.com/my-uncle) (blog entry).

- [Morse code](https://en.wikipedia.org/wiki/Morse_code) is a variable length code based on frequency of symbols. Unlike Huffman, Morse is a prefixed code. (Wikipedia article).

- [Ham radio](https://en.wikipedia.org/wiki/Amateur_radio) is an endeavor where Morse code is used. It is also a platform for experimentation at the boundary of electronics and computing today. (Wikipedia article).


<center>

![Huffman tree](https://raw.githubusercontent.com/lgreco/images/refs/heads/main/huffman/huffman_tree.png)

_A Huffman tree for the message `HELLO WORLD`. Leaf nodes representing the most frequent symbols (for example, L with frequency 3) are closer to the root of the tree than less frequent symbols. Any node that is not a leaf, stores the combined frequencies of each children nodes._

</center>

The Huffman code for each symbol node is the path leading to that node. All symbol nodes are leaf nodes, thus ensuring that no symbol is a prefix to another symbol. For example, the path from the root node to symbol `H` is `0000`. There is no other symbol that begins with `000`. If such a symbol existed, for example, `0000`**`0`** or `0000`**`1`** it would mean that it's either a left or a right child of `H`. But since all symbol nodes are leaf nodes, they cannot have children nodes.


## Huffman's Encoding

Huffman encoding compresses an input message using a prefix-free, variable length code that assigns shorter codes for frequent symbols.

For example, the message _HELLO WORLD,_ encoded in ASCII is represented by the values<br/>`72 69 76 76 79 32 87 79 82 76 68`.<br/>Their binary representation, 8 bits per character, is:

```text
0100100001000101010011000100110001001111001000000101011101001111010100100100110001000100
```

In ASCII representation, each character has the same length: 8 bits. So the message above is 88 characters long beucase there are 11 characters in `HELLO WORLD`.

Morse code, on the other hand, assigns shorter representations to more frequent characters. For example, the most frequent letter in English, `E`, is represented by the shortest possible Morse code: a single dot `.`, the next most frequent T, as a single dash `-`, and so on. The message `HELLO WORLD` in Morse code is transmitted as

```text
......-...-..---.-----.-..-..-..
```

which has a length of 32 symbols.

The problem with the Morse encoding above is that we can't tell the letters apart. And for all we know, the corresponds to `EEEEEETEEETEETTTETTTTTETEETEETEE` which doesn't make sense. In reality, however, when listening to Morse code, we detect brief paused between letters and words. Using a third symbol, `/` to separate letters and words, the message becomes

```text
.... / . / .-.. / .-.. / --- // .-- / --- / .-. / .-.. / -..
```

Now its length is 42 characters; still shorter than the 88 characters of the ASCII representation.


## Outline of Huffman's method

### Frequencies

The encoding process begins by counting the frequencies of the symbols in the message. For the example message, `HELLO WORLD`, the symbol frequencies are given below. (The space symbol is shown as `⎵`).

| H   | E   | L   | O   | ⎵  | W   | R   | D   |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | 1   | 3   | 2   | 1   | 1   | 1   | 1   |

There are several ways to implement the frequency table above. An array with 256 elements (one for each possible ASCII value) is the most direct one. In this implementation, `frequency[i]` is the frequency of symbol with ASCII value `i`. Alternatively, a key-value pair structure may capture the frequency of a symbol, for example `frequency['L']=3`.

```python
# Plain list approach                               # Key-value pair with dictionary
frequencies = [0] * 256                             frequencies = {}
for char in message:                                for char in message:
    frequencies[ord(char)] += 1                         if char in frequencies:
                                                            frequencies[char] += 1
                                                        else:
                                                            frequencies[char] = 1
```


### From frequencies to nodes

The next step is to convert all the symbol-frequency pairs into binary tree nodes. A simple UML diagram for the node object is given below. Class Node is also provided in the codebase for this assignment.

```text
+-----------------------+
| Node                  |
+-----------------------+
| symbol: str | None    |
| frequency: int        |
| left: Node | None     |
| right: Node | None    |
+-----------------------+
| __init__: None        |
| __lt__: bool          |
| get_frequency: int    |
| set_left(Node): None  |
| set_right(Node): None |
+-----------------------+
```

The code below converts the symbol-frequency data into leaf nodes for a binary tree.

```python
# Plain list approach                               # Key-value pair with dictionary
forest = []                                         forest = []
for i in range(256):                                for symbol, frequency in frequencies.items():
    if frequencies[i] > 0:                              node = Node(symbol, frequency)
        node = Node(chr(i), frequencies[i])             forest.append(node)
        forest.append(node)
```


### From nodes to forest

Once we have all the symbols with their frequencies in a forest of leaf nodes we can execute the core of Huffman's algorith.

$$
\begin{align*}
& \textbf{generate huffman tree}(\text{forest}): \\
& \qquad \textbf{while}\ \text{forest has more than 1 node}: \\
& \qquad \quad n_1, n_2 \leftarrow \text{remove nodes with least frequencies from forest} \\
& \qquad \quad n_{12}\leftarrow \text{new symbol-less node with sum of frequencies from}\ n_1, n_2\\
& \qquad \quad n_{12}.\textsf{left},\ n_{12}.\textsf{right}\leftarrow n_1,\ n_2\\
& \qquad \textbf{return}\ \text{forest}
\end{align*}
$$


### From forest to a tree

At the end of the `while` loop above, there is only one node in the `forest`. It is now the root node of a binary tree whose leaf nodes are all symbol nodes that were initially in the `forest`. Let's call it the *Huffman tree.* The path from the root of the tree to each leaf node is a unique, prefix-free encoding for the symbol in the leaf node. The length of the path is inversely proportional to the frequency of the symbol. Frequent symbols are closer to the root.

<center>

![Huffman tree](https://raw.githubusercontent.com/lgreco/images/refs/heads/main/huffman/huffman_tree_smaller.png)

</center>

The path to each leaf node in the binary tree, represents a unique, prefix-free, variable length code for the corresponding symbol. For example, the path to the leaf node for `L` is right of the root, then left. Using `0` to indicate the left child of a tree node and `1` for the right, `L` is encoded as `10`. Because it is a leaf node, there is no other symbol whose encoding will ever begin with `10` -- the encoding is prefix free.

By traversing the tree to each leaf node, we obtain the encoding for each symbol:

<center>

| H    | E    | L  | O   | ⎵   | W   | R   | D   |
|------|------|----|-----|-----|-----|-----|-----|
| 0000 | 0001 | 01 | 001 | 100 | 101 | 110 | 111 |

</center>

Using this codes, the `HELLO WORLD` message is encoded as

```text
00000001010100110010100111001111
```
This encoding is 32 bits long - much shorter than any of the previous alternatives.

### From encoded to decoded content

Given an encoded message like `00000001010100110010100111001111`, how do we unpack it to its original form `HELLO WORLD`? This is where things get super easy.

$$
\begin{align*}
& \textbf{decode}(\text{message, root}): \\
& \qquad n\leftarrow\text{root}\\ 
& \qquad \text{decoded} \leftarrow \texttt{""}\\ 
& \qquad \textbf{for each}\ \text{character}\ c\ \textbf{in}\ \text{message}:\\ 
& \qquad \quad \textbf{if}\ c=0 \\ 
& \qquad \quad\quad n \leftarrow n.\textsf{left} \\ 
& \qquad \quad \textbf{else}\\ 
& \qquad \quad\quad n \leftarrow n.\textsf{right} \\
& \qquad \quad \textbf{if}\ n\ \textbf{is}\ \text{leaf node} \\ 

& \qquad \quad\quad  \text{decoded} \leftarrow \text{decoded} + n.\textsf{symbol} \\
& \qquad \quad\quad n \leftarrow \text{root} \\
\end{align*}
$$

### Work to do

Assemble the functions necessary to encode and decode a message using Huffman's technique. Ideally you may want to consider placing them in a class. This will minimize the amount of variables you need to pass as input arguments or return values. If you want to implement stand-alone functions, that's fine too.

The schematics of the encoder and decoder are shown below.

<center>

![](https://raw.githubusercontent.com/lgreco/images/refs/heads/main/huffman/huffman_encoder.png)

---

![](https://raw.githubusercontent.com/lgreco/images/refs/heads/main/huffman/huffman_decoder.png)


</center>

One *big* question here is *how* to pass the Huffman tree in the decoding process? The obvious answer is to pass the root of the tree (and therefore the entire tree) as an argument to the appropriate method, together with the encoded message that we would like to decode. This works fine for illustrative demonstrations of encoding/decoding. However, if we wish to encode a message and transmit it over a network, it is not clear how to also transmit a binary tree. In such cases we'll need to [serialize](https://en.wikipedia.org/wiki/Serialization) the object. For now, we don't have to worry about serialization.

It is important to assess the quantitative merits of the encoding. In our example of encoding `HELLO WORLD`, we went from 88 bits in plain ASCII to 32 bits of Huffman codes. That's nearly a 60% reduction in the space required for the message. **However,** it would take about 3,200 bytes to store the Huffman tree. In other words, we need an additional 25,600 bits to compress an 88 bit message down to 32 bits. That's not very efficient.

Your code should include functionality that measures how many bits we save with the Huffman encoding, but also how many bits we need to store the Huffman tree. To measure how much memory is required for an object (such as the root of a tree, and therefore the tree itself), you'll need a tool like [`pympler.asizeof.asizeof`](https://pympler.readthedocs.io/en/latest/#). To use it:

```python
from pympler import asizeof
# ... and later
memory_required = asizeof.asieof(some_object)
```

`pympler` is not part of the basic Python distribution and you may have to install it before importing it. From the command line interface you can type

```text
pip install pympler
```

If you prefer to install directly from a Jupyter notebook, use

```text
!pip install pympler
```
in the first line of your code (note the exclamation point at the beginning of the line when invoked from the code. The exclamation point is not needed when installing from the CLI).

To test the efficiency of encoding, you may use any of the **plain text** versions of the books in [Project Gutenberg ](https://www.gutenberg.org/) as inputs. My favorite one is James Joyce's *Ulysses* because it's massive. For this testing, you may want to brush your Python [file reading skills](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files).

## What to submit?

A single **`.py`** file with your code, uploaded on Sakai. As always, your best coding skills should shine through this assignment.

---

# Codebase


In [2]:
from __future__ import annotations  # for cool type hinting


class Node:
    """A simple binary tree node for a basic Huffman encoder."""

    def __init__(self,  symbol: str | None, frequency: int):
        """Object constructor.

        Inputs
        ------
        frequency : int
          The frequency represented by this node. If the node has also a symbol,
          this is the frequency of the symbol. If no symbol is present, this is
          the sum of frequencies of the node's subtrees.
        symbol : char
          The symbol whose frequency we capture. If symbol is None, the node
          captures frequencies for subtrees under the node.

        Returns
        -------
        Instance of Node object with fields:
          frequency : as described above
          symbol : as described above
          left : pointer to left node child (default none)
          right : pointer to right node child (default none)
        """
        self.__frequency: int = frequency
        self.__symbol: str | None = symbol
        self.__left: None | Node = None
        self.__right: None | Node = None

    def __lt__(self, other: Node):
        """Redefine < for node to be based on frequency value"""
        return self.__frequency < other.get_frequency()

    def set_left(self, left: Node | None):
        """Setter for left child."""
        self.__left = left

    def set_right(self, right: Node | None):
        """Setter for right child."""
        self.__right = right

    def has_left(self):
        """Predicate accessor for left child"""
        return self.__left is not None

    def has_right(self):
        """Predicate accessor for right child"""
        return self.__right is not None

    def get_left(self):
        """Accessor for left child"""
        return self.__left

    def get_right(self):
        """Accessor for right child."""
        return self.__right

    def get_symbol(self) -> str:
        """Accessor for the symbol in a leaf node"""
        return self.__symbol

    def get_frequency(self):
        """Accessor for frequency."""
        return self.__frequency

    def is_leaf(self) -> bool:
        """Determines if node is leaf node, indicated by the
        absence of both child pointers."""
        return self.__left is None and self.__right is None

    def __str__(self):
        """String representation of object."""
        return f"[ {self.__symbol}: {self.__frequency} ]"