# Lab 11 Examples (Huffman Trees)

Click <shift> <enter> in each code cell to run the code. Be sure to start with the `#include` directives to load the required libraries.

In [25]:
// For Lab 11, we are limited to using only the following #include directives:

#include <iostream>
#include <string>
#include <iomanip>
#include <cmath>
#include <cstdio>

# Overview

In today's lab, we will look at Huffman Trees. A Huffman Tree is a data structure used for data compression. We will build a Huffman Tree and use to to decode an encoded string. Before we look specifically at Huffman Trees, though, let's review trees and binary trees in general.

## Trees and Binary Trees

A **tree** is a special type of graph that has the following properties:
- It is connected, meaning there is a path between any two nodes.
- It has no cycles, meaning there is no way to start at a node and follow a path that leads back to the same node.
- It has a designated root node, which is the topmost node in the tree.

Outside, a tree's roots are at the bottom, and they branch upward. Trees as a data structure are the opposite: the root is at the top, and the branches extend downward.

A **binary tree** is a type of tree where each node has at most two children, a left child and a right child. Binary trees.

Here is an example of a simple binary tree:

```text
        A
       / \
      B   C
     / \   \
    D   E   F
```

In a **binary search tree**, the left child of a node contains a value less than its parent node, and the right child contains a value greater than its parent node. This makes it easy to find values on average in $\mathcal{O}(log~n)$.

```text
        5
       / \
      3   8
         / \
        7   9
```

## Representing Trees in C++

Unlike with arrays, where elements are stored in contiguous memory locations, trees are typically represented using nodes that contain data and pointers. The data part holds the value that the node means to store, and the points link to other nodes in the tree. In a binary tree, each node will likely have two points, one to each child. This allows the tree to be traversed from root down to the leaves.

But a tree can also have a pointer to its parent node, which allows traversal from child to parent. In our Huffman Tree implementation, we will start with child nodes, and we will connect them by creating a common parent node.

The following example is not a Huffman Tree, but it is a simply binary tree that illustrates how nodes can be connected and traversed.

In [37]:
struct Node {
    int data;
    Node* left;
    Node* right;

    Node(int val) : data(val), left(nullptr), right(nullptr) {}
};

Node* root = new Node(5);
Node* current = root;

current->left = new Node(3);
current->right = new Node(8);

current = root;
printf("Current Node Data: %d\n", current->data);

current = current->left;
printf("Current Node Data: %d\n", current->data);

current = root->right;
printf("Current Node Data: %d\n", current->data);

Current Node Data: 5
Current Node Data: 3
Current Node Data: 8


In [38]:
void printSimple(Node* node, int depth = 0) {
    if (node == nullptr) return;
    printSimple(node->right, depth + 1);
    for (int i = 0; i < depth; ++i) printf("    " );
    printf("%d\n", node->data);
    printSimple(node->left, depth + 1);
}

printSimple(root);

    8
5
    3


In [39]:
// Let's next add a couple children to node 8 and then print the tree again.
root->right->left = new Node(7);
root->right->right = new Node(9);
printSimple(root);

        9
    8
        7
5
    3


In [40]:
// And now we'll set our current pointer to the root,
// and traverse all the way down the right side of the tree.
current = root;
while (current != nullptr) {
    printf("Current Node Data: %d\n", current->data);
    current = current->right;
}

Current Node Data: 5
Current Node Data: 8
Current Node Data: 9


In [None]:
// Now that we're finished with our simple example tree,
// let's delete our dynamically allocated nodes and set
// our pointers to NULL.

void deleteTree(Node* ptr) {
    if (ptr == nullptr)
        return;

    deleteTree(ptr->left);
    deleteTree(ptr->right);
    delete root;
}

deleteTree(root);
root = nullptr;
current = nullptr;

## Huffman Trees

A **Huffman Tree** is a specific type of binary tree used for data compression. It is built based on the frequency of characters in a given dataset. The idea is to assign shorter binary codes to more frequent characters and longer codes to less frequent characters, which helps reduce the overall size of the data when encoded.

The `char` datatype in C++ is associated with a unique ASCII code value. The size of each `char` is typically 1 byte (8 bits), which allows for 256 different characters (from 0 to 255 in decimal).  If the alphabet of the data we want to encode is smaller than 256 characters, we can use a fewer number of bits per character.

In today's lab, we'll be working with an alphabet of only 6 characters: A, B, C, D, E, and F. With 6 characters, we only need 3 bits to represent each character using a fixed-length encoding scheme:

```text
A: 000
B: 001
C: 010
D: 011
E: 100
F: 101
```

If the data that uses this alphabet is 128 characters long, using a fixed-length encoding scheme such as this one would require $128 * 3 = 384$ bits to encode the message.

Can we do better? Do we need three bits (a fixed length scheme) for every character? What if some characters are more common than other characters? we we give those more frequent characters shorter codes?

How about this? Do you see a problem with this encoding scheme?

```text
A: 10
B: 101
C: 100
D: 11
E: 1
F: 110
```

Withe the fixed length scheme, we always knew where the end of one character was and the beginning of the next character. Each codeword was exactly 3 bits long.

With a variable length scheme like this how, if we read a `1`, how do we know if we're reading the letter `E` (codeword `1`) or the start of the letter `A` (codeword `10`) or the start of the letter `B` (codeword `101`)? This is the **prefix problem**. We must ensure that no codeword is the first part (or prefix) of another codeword.

We can achieve a variable length encoding scheme without the prefix problem by using the Huffman algorithm.

## The Huffman Algorithm

