In [2]:
%matplotlib inline

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import heapq
import os

# Huffman Coding
## Author: Tsvetan Dimitrov


### Abstract
TODO

###  Lossy vs Lossless Compression
In this section we will look at lossy vs lossless compression and the advantages and disadvantages of both methods. There is no right or wrong method. It all comes down to taking some notice of a number of different factors. 

#### Lossy Compression
The first type is lossy compression which refers to some of the data from the original file being lost after compression is executed. The process is irreversible and once you convert to lossy, you cannot go back. And the more you compress it, the more degradation occurs. JPEG and GIF are both lossy image formats. One of the biggest obvious benefits to using lossy compression is that it results in a significantly reduced file size (smaller than lossless compression method), but it also means there is quality loss. Most tools, plugins, and software out there will let you choose the degree of compression you want to use.

#### Lossless Compression
The other type is lossless compression which refers to compression without any data or quality loss. All of original data can be recovered when the file is uncompressed. RAW, BMP, GIF, and PNG are all lossless image formats. The big benefit to lossless compression is that you can retain the original quality of your data or images and still achieve a smaller file size. This is generally the technique of choice for text or spreadsheet files, where losing words or financial data could pose a problem. In this article we will explore and implement a lossless technique called Huffman Coding. 

### Information Entropy
In information theory, the major goal is for one person (a transmitter) to convey some message (over a channel) to another person (the receiver). To do so, the transmitter sends a series (possibly just one) partial messages that give clues towards the original message. The information content of one of these partial messages is a measure of how much uncertainty this resolves for the receiver. Let us try a simple experiment. We will assume that the weather is equally probable to be at any one of 4 possible states at any given moment. This translates to having the same probability of a state occurring which is 1/4 in our case.

|             | Sunny | Cloudy | Rainy | Foggy |
|-------------|-------|--------|-------|-------|
| probability |  1/4  |  1/4   |  1/4  |  1/4  |

Now the question is what is the minimal number of bits we can use to store each of these states based on their probability? Our probability can be expressed as follows: 
$$\frac{1}{4} = \frac{1}{2^2} = 2^{-2}$$

$$\text{minimum number of bits} = -\log (p) = -\log \frac{1}{4} = 2$$
So we can now add codes to our table for each weather state:

|             | Sunny | Cloudy | Rainy | Foggy |
|-------------|-------|--------|-------|-------|
| probability |  1/4  |  1/4   |  1/4  |  1/4  |
| code        |  00   |  01    |   10  |   11  |

Each code has to uniquely identify the data that it is encoding, otherwise we will not be able to decode it correctly. Let us now try to change the probabilities and calculate their corresponding codes:

|             | Sunny | Cloudy | Rainy | Foggy |
|-------------|-------|--------|-------|-------|
| probability |  1/2  |  1/4   |  1/8  |  1/8  |
| code        |  0    |  10    |  110  |  111  |


The fact that we can derive from our experiment is that the lower the probability value of a data source, the more information a message transfer event has to carry than a data source with a higher probability value. A partial message that cuts the number of possibilities in half transmits one bit of information about the message. In essence, the "information content" can be viewed as how much useful information the message actually contains. The entropy, in this context, is the expected number of bits of information contained in each message, taken over all possibilities for the transmitted message. In information theory entropy is denoted with $H$ and has the following definition:
$$ H(X) = -p\log (p)$$
This gives us the weighted average of bits for the current partial message or state.

### Algorithm
Huffman coding is a way to encode information using variable length strings to represent symbols depending on how frequently they appear. The idea is that symbols that are used more frequently should be shorter while symbols that appear more rarely can be longer. This way, the number of bits it takes to encode a given message will be shorter, on average, than if a fixed-length code was used. In messages that include many rare symbols, the string produced by variable-length encoding may be longer than one produced by a fixed-length encoding.



In [6]:
class HeapNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None
    
    def __cmp__(self, other):
        if other == None or not isinstance(other, HeapNode):
            return -1
        return self.freq > other.freq

In [None]:
class HuffmanTree:
    def __init__(self):
        self.heap = []
        self.frequency_dict = {}
        
    def __make_frequency_dict(text):
        for char in text:
            if not char in self.frequency_dict:
                self.frequency_dict[char] = 0
            self.frequency_dict[char] += 1
    
    def __make_heap():
        for key in self.frequency_dict:
            node = HeapNode(key, self.frequency_dict[key])
            heapq.heappush(self.heap, node)
            
    def __merge_nodes():
        while len(self.heap) > 1:
            left_node = heapq.heappop(self.heap)
            right_node = heapq.heappop(self.heap)

            merged_node = HeapNode(None, left_node.freq + right_node.freq)
            merged_node.left = left_node
            merged_node.right = right_node

            heapq.heappush(self.heap, merged_node)
    
    def construct_tree(text):
        __make_frequency_dict(text)
        __make_heap()
        __merge_nodes()
        
        return self.heap

In [None]:
class HuffmanCodeMap():
    def __init__(self, tree):
        self.tree = tree
        self.code_map = {}
        self.reverse_code_map = {}
    
    def __make_codes(self, root, current_code):
        if root == None:
            return
        
        if root.char != None:
            self.code_map[root.char] = current_code
            self.reverse_code_map[current_code] = root.char
            return
        self.__make_codes(root.left, f'{current_code}0')
        self.__make_codes(root.right, f'{current_code}1')
        
    def construct_code_maps(self):
        root = heapq.headpop(self.tree)
        current_code = ''
        self.__make_codes(root, current_code)
        return {
            'code_map': code_map,
            'reverse_code_map': reverse_code_map
        }

### Conclusion

### Future Work

### References