## Overview - Data Compression

In general, a data compression algorithm reduces the amount of memory (bits) required to represent a message (data). The compressed data, in turn, helps to reduce the transmission time from a sender to receiver. The sender encodes the data, and the receiver decodes the encoded data. As part of this problem, you have to implement the logic for both encoding and decoding.

A data compression algorithm could be either **lossy** or **lossless**, meaning that when compressing the data, there is a loss (lossy) or no loss (lossless) of information. The **Huffman Coding** is a *lossless* data compression algorithm. Let us understand the two phases - encoding and decoding with the help of an example.

### A Huffman Encoding
Assume that we have a string message `AAAAAAABBBCCCCCCCDDEEEEEE` comprising of 25 characters to be encoded. The string message can be an unsorted one as well. We will have two phases in encoding - building the Huffman tree (a binary tree), and generating the encoded data. The following steps illustrate the Huffman encoding:

#### $\color{blue}{\text{Phase I - Build the Huffman Tree}}$
A Huffman tree is built in a bottom-up approach.

1. First, determine the frequency of each character in the message. In our example, the following table presents the frequency of each character.

    | (Unique) Character  | Frequency |
    | --- | --- |    
    | A | 7 |
    | B | 3 |
    | C | 7 |
    | D | 2 |
    | E | 6 |                     
                            
2. Each row in the table above can be represented as a *node* having a character, frequency, left child, and right child. In the next step, we will repeatedly require to pop-out the node having the lowest frequency. Therefore, build and sort a *list* of nodes in the order lowest to highest frequencies. Remember that a *list* preserves the order of elements in which they are appended. 

We would need our list to work as a [priority queue](https://en.wikipedia.org/wiki/Priority_queue), where a node that has lower frequency should have a higher priority to be popped-out. The following snapshot will help you visualize the example considered above:

<img src="images/huffmancoding1.png">

*Can you come up with other data structures to create a priority queue? How about using a min-heap instead of a list? You are free to choose from anyone.*

3. Pop-out two nodes with the minimum frequency from the priority queue created in the above step.

4. Create a new node with a frequency equal to the sum of the two nodes picked in the above step. This new node would become an internal node in the Huffman tree, and the two nodes would become the children. The lower frequency node becomes a left child, and the higher frequency node becomes the right child. Reinsert the newly created node back into the priority queue.

**Do you think that this reinsertion requires the sorting of priority queue again?** If yes, then a min-heap could be a better choice due to the lower complexity of sorting the elements, every time there is an insertion.

5. Repeat steps #3 and #4 until there is a single element left in the priority queue. The snapshots below present the building of a Huffman tree.

<img src="images/huffman-tree-1.png">

6. For each node, in the Huffman tree, assign a bit `0` for left child and a `1` for right child. See the final Huffman tree for our example:

<img src="images/huffman-tree-3.png">

$\color{blue}{\text{Phase II - Generate the Encoded Data}}$
7. Based on the Huffman tree, generate unique binary code for each character of our string message. For this purpose, you'd have to traverse the path from root to the leaf node.

   | (Unique) Character  | Frequency | Huffman Code |
   | --- | --- | --- |   
   | D | 2 | 000 |
   | B | 3 | 001 |
   | E | 6 | 01  | 
   | A | 7 | 10  |
   | C | 7 | 11  | 

$\textbf{\textit{Points to notice}}$

- *Notice that the whole code for any character is not a prefix of any other code. Hence, the Huffman code is called a [Prefix code](https://en.wikipedia.org/wiki/Prefix_code).*
- *Notice that the binary code is shorter for the more frequent character, and vice-versa.*
- *The Huffman code is generated in such a way that the entire string message would now require a much lesser amount of memory in binary form.*
- *Notice that each node present in the original priority queue has become a leaf node in the final Huffman tree.*

This way, our encoded data would be<br>
`1010101010101000100100111111111111111000000010101010101`

### B. Huffman Decoding
Once we have the encoded data and pointer to the root of huffman tree;decoding, becomes a piece of cake. Via the following steps we can finish it..

1. Declare a blank decoded string
2. Pick a bit from encoded data, traversing from left to right.
3. Start traversing the Huffman tree from the root.
    - If current bit of encoded data is `0`,move to the left child,else move to the right child of the tree if the current bit is `1`.
    - If a leaf node is encountered,append the alphabetical character of the leaf node to the decoded string.

4. Repeat steps #2 and #3 until the whole encoded data is traversed.



In [2]:
class Node:
        
    def __init__(self,key = None,value = None):
        self.root = None
        self.key = key
        self.value = value
        self.left = None #left_child
        self.right = None #right_child
        self.bit = None
        
    def set_value(self,value):
        self.value = value
        
    def get_value(self):
        return self.value
        
    def set_left_child(self,left):
        self.left = left
        
    def set_right_child(self, right):
        self.right = right
        
    def get_left_child(self):
        return self.left
    
    def get_right_child(self):
        return self.right

    def has_left_child(self):
        return self.left != None
    
    def has_right_child(self):
        return self.right != None

    def set_bit(self,value):
        self.bit = value

    def get_bit(self):
        return self.bit

class Minheap:
    
    #creating a min_heap to simulate a priority queue
    def __init__(self,value=None,):
        self.arr = []

    ##common operations are:
    # Find minimum which is the root node (peek)
    #operation without modifying the heap
    def find_min(self):
        return self.arr[0]

    #insert a new key to the heap
    #NB: need to maintain heap property during insertion
    def push(self,key=None,value=None,node=None):
        """
        Accepts either a key-value pair,
        or a Node object.
        """
        if key != None and value != None:
            new_node = Node(key,value)

        else:
            new_node = node
        
        #insert at the end of the heap first
        self.arr.append(new_node)
        #if heap property is violated implement sift up operation to restore heap operation
        for i in range(len(self.arr)-1,-1,-1): #starting from the back of the array so as to do a sift up operation
            if new_node.value < self.arr[i].value:
                child = self.arr[i]
                self.arr[i] = new_node
                self.arr[i+1] = child

    
    def pop(self):
        #remember the removal is for the minimum value i.e root in this case.
        return self.arr.pop(0)

    # return number of elements in heap
    def size(self):
        return len(self.arr)

    # return true if heap is empty or false otherwise    
    def is_empty(self):
        if len(self.arr) == 0:
            return True
        return False



import sys

def det_frequency(message):
    freq_map = {} #create a dictionary to hold chatacters and their respective counts
    
    for char in message:
        if char not in freq_map: #initializing frequency count
            freq_map[char] = 1 
        else:
            freq_map[char] += 1 #increase count depending on occurences
    return freq_map

def generate_code(tree):

    encoded_dict = {} #mapping character to respective binary code

    str_code = "" #string to store the binary code

    generate_code_recursively(tree,encoded_dict,str_code)

    return encoded_dict

def generate_code_recursively(huffmantree,code_dict,str_code):
    node = huffmantree
    # base case
    if not node.has_left_child() and not node.has_right_child():
        code_dict[node.key] = str_code 
        return
    #recursive cases
    else:
        if node.has_left_child() and node.has_right_child(): #node is internal in huffman tree
            str_code += str(node.left.get_bit())
            generate_code_recursively(node.left,code_dict,str_code)
            str_code = str_code[0:len(str_code)-1]
            str_code += str(node.right.get_bit())
            generate_code_recursively(node.right,code_dict,str_code)

def huffman_encoding(data):
    encoded_data = ""
    encoded = []
    #Determining freuency of message string characters:
    freq_dict = det_frequency(data)
    ## Building a priority queue:##
    min_heap  = Minheap()
    
    #constructing a min_heap priority queue from frequency dictionary items
    for key,value in freq_dict.items():  
        min_heap.push(key,value)
    
    # while loop to build the huffman tree
    while len(min_heap.arr) != 1:
        #popping first two nodes from the priority queue and assigning bit codes
        first_node = min_heap.pop()
        first_node.set_bit(0)
        second_node = min_heap.pop()
        second_node.set_bit(1)
        #creating a new_node with frequency equal to sums of the above two nodes values
        sum_nodes = first_node.value + second_node.value
        internal_node = Node(str(sum_nodes),sum_nodes)
        internal_node.left = first_node
        internal_node.right = second_node
        #newly created internal_node to be inserted back to priority queue again
        min_heap.push(node=internal_node)

    #traversing the huffman_tree from root to node to generate encoded data
    tree = min_heap.arr[0] #reference to huffman tree node
    #traversing the huffman_tree from root to node to generate encoded data dictionary
    encoded_dict = generate_code(tree)
    # generating encoded data
    for key in data: #problem statement not encoding data as it should be i.e in in appearance order need to get the key from the data
     encoded.append(encoded_dict[key])

    for bit in encoded:
        encoded_data += str(bit)

    # encoded_data = int(encoded_data)
    
    return encoded_data,tree

def huffman_decoding(data,tree):
    decoded_message = ""
    node = tree
    for bit in data:
        if bit == '1':
            node = node.right
        if bit == '0':
            node = node.left

        if not node.has_left_child() and not node.has_right_child(): # if leaf node
            decoded_message += node.key
            node = tree
            continue

    return decoded_message

if __name__ == "__main__":
    codes = {}

    a_great_sentence = "The bird is the word"

    print ("The size of the data is: {}\n".format(sys.getsizeof(a_great_sentence)))
    print ("The content of the data is: {}\n".format(a_great_sentence))

    encoded_data, tree = huffman_encoding(a_great_sentence)

    print ("The size of the encoded data is: {}\n".format(sys.getsizeof(int(encoded_data, base=2))))
    print ("The content of the encoded data is: {}\n".format(encoded_data))

    decoded_data = huffman_decoding(encoded_data, tree)

    print ("The size of the decoded data is: {}\n".format(sys.getsizeof(decoded_data)))
    print ("The content of the decoded data is: {}\n".format(decoded_data))

The size of the data is: 69

The content of the data is: The bird is the word

The size of the encoded data is: 36

The content of the encoded data is: 0110111011111100111000001010110000100011010011110111111010101011001010

The size of the decoded data is: 69

The content of the decoded data is: The bird is the word

