# HW 11
___

In [288]:
from collections import Counter
import math

### Counting Frequencies
An inefficient way to count the letter frequencies in a string is to call `.count()` for each letter of the alphabet.

A more efficient method is to use a `Counter` which is a subclass of `dict`. Documentation can be found here: https://docs.python.org/3/library/collections.html#collections.Counter.

Example:
```
from collections import Counter
ct = Counter()
ct.update('banana')
ct.update('bun')
ct
```
returns 
```
Counter({'b': 2, 'a': 3, 'n': 3, 'u': 1})
```
which can be used like a dictionary. To sort by most frequent to least frequent, call
```
ct.most_common()
```
which will return the list
```
[('a', 3), ('n', 3), ('b', 2), ('u', 1)]
```

### Flatland

The file `'flatland.txt'` contains the text of the book *Flatland* by Edwin A. Abbott, which is a satire about Victorian England. Its main characters are geometric shapes. **Calculate the frequencies** of the 26 letters of the alphabet in the text using a `Counter`. Save the result in **`flatland_freq`**.

* For space efficiency, **read the file line by line**. For example:
```
with open('flatland.txt') as fp:
    for line in fp:
        ...
```
* Use `.isalpha()` to distinguish letters from non-alphabetic characters.
* Use `.lower()` to convert upper case characters to lower case.

In [289]:
alphabet = 'abcdefghijklmnopqrstuvwxyz' 

In [290]:
flatland_freq = Counter()
with open('flatland.txt') as fp:
    for line in fp:
        line.strip()
        string = ''
        for char in line:
            if char.isalpha():
                string += char.lower()
        flatland_freq.update(string)

In [291]:
flatland_freq

Counter({'f': 3694,
         'l': 6402,
         'a': 11460,
         't': 13582,
         'n': 10814,
         'd': 5686,
         'r': 8706,
         'o': 11790,
         'm': 3986,
         'c': 4438,
         'e': 18666,
         'y': 3125,
         'i': 11154,
         's': 9785,
         'w': 2665,
         'b': 2148,
         'g': 2810,
         'h': 7955,
         'u': 4488,
         'p': 2588,
         'q': 193,
         'k': 502,
         'v': 1418,
         'z': 85,
         'x': 379,
         'j': 125})

### Flatland Fixed-Length Encodings
Suppose the letters in the alphabet are represented using fixed-length ternary (base 3) codes. **Calculate the total number of ternary digits needed** to encode the 26 letters in *Flatland* (converting upper case letters to lower case). Store the result in `flatland_digit_ct_fixed`.

For example, the first 5 letters of the alphabet can be represented as two-digit base 3 numbers: `a=00`, `b=01`, `c=02`, `d=10`, and `e=11`. Then the encoding for the word `aced` would require 8 digits: `00021110`.

a = 000  
b = 001  
c = 002  
d = 010  
e = 011  
f = 012  
g = 020  
h = 021  
i = 022  
j = 100  
k = 101  
l = 102  
m = 110  
n = 111  
o = 112  
p = 120  
q = 121  
r = 122  
s = 200  
t = 201  
u = 202  
v = 210  
w = 211  
x = 212  
y = 220  
z = 221  

In [292]:
total = 0
for lett_count in flatland_freq.values():
    total += 3*lett_count

total

445932

In [293]:
flatland_digit_ct_fixed = 445932

### Huffman Code

Write a function **`huffman(char_freq)`** that takes a Counter containing `ch: freq` key-value pairs representing letter frequencies, and **returns a dictionary** containing the ternary encodings for the characters. The dictionary keys will be the characters, and the values will be the base 3 encodings in string format. Assume that there are at least 3 characters in `char_freq`.

The algorithm will use a **ternary tree** (instead of a binary tree) composed of `HuffNode`s (defined below) with each node having up to 3 children. The children should be arranged from left to right in order of increasing frequency. (It is not necessary for the function to implement an efficient min-priority queue; it may call `sorted()`.)

**Note**: An optimal encoding can be found if the number of characters is odd. If there is an even number of characters, add a dummy character `'@'` with frequency 0. This will ensure that the root will have 3 children.

**Example**: 
```
char_freq = Counter({'a': 45, 'b': 10, 'c': 18, 
                     'd': 48, 'e': 22, 'f': 33})
huffman(char_freq)

```
returns (in some order)
```
{'b': '211', 'f': '22', 'a': '0', 'd': '1', 'c': '212', 'e': '20'}

```

In [294]:
class HuffNode:
    def __init__(self, ch, freq):
        self.char = ch  # set to '' if internal node
        self.freq = freq
        self.parent = None
        self.left = None
        self.middle = None
        self.right = None

In [295]:
def merge(counter, node_dict):
    '''merge three lowest frequency nodes'''
    freq_lst = counter.most_common()
    
    
    #grab 3 lowest freqs and corresponding chars
    freq = freq_lst[-1][1] + freq_lst[-2][1] + freq_lst[-3][1]
    
    left = freq_lst[-1][0]
    mid = freq_lst[-2][0]
    right = freq_lst[-3][0]
    
    char= left+mid+right
    
    
    # make new node and add to node_dict
    z = HuffNode(char, freq)
    z.left = node_dict[left]
    node_dict[left].parent = z
    z.middle = node_dict[mid]
    node_dict[mid].parent = z
    z.right = node_dict[right]
    node_dict[right].parent = z
    
    node_dict[char] = z
    
    del(counter[left])
    del(counter[mid])
    del(counter[right])
    counter[char] = z.freq

Now that we have the right tree, need to find the ternary code for each character. 

In [296]:
def get_code(char, node_dict):
    code = ''
    while node_dict[char].parent != None:
        if node_dict[char] == node_dict[char].parent.left:
            code+='0'
        elif node_dict[char] == node_dict[char].parent.middle:
            code+='1'
        else:
            code+='2'
        char = node_dict[char].parent.char
    return code[::-1]

In [297]:
def huffman(char_freq):
    count = char_freq.copy()
    
    og_chars = list(count.keys())
    
    if len(count)%2==0:
        count['@'] = 0
        
    nodes = {}
    for char in count:
        nodes[char] = HuffNode(char, count[char])
        
    while len(count) > 1:
        merge(count, nodes)
    
    result = {}
    for char in og_chars:      # O(nlogn)
        result[char] = get_code(char, nodes)
    
    return result

In [298]:
char_freq = Counter({'a': 45, 'b': 10, 'c': 18, 
                     'd': 48, 'e': 22, 'f': 33})

In [299]:
huffman(char_freq)

{'a': '0', 'b': '211', 'c': '212', 'd': '1', 'e': '20', 'f': '22'}

In [300]:
char_freq

Counter({'a': 45, 'b': 10, 'c': 18, 'd': 48, 'e': 22, 'f': 33})

### Flatland Encodings
Call `huffman(flatland_freq)` and store the result in `flatland_huffman_codes`.

In [301]:
flatland_huffman_codes = huffman(flatland_freq)

In [302]:
flatland_huffman_codes

{'f': '100',
 'l': '122',
 'a': '01',
 't': '11',
 'n': '222',
 'd': '121',
 'r': '220',
 'o': '02',
 'm': '101',
 'c': '102',
 'e': '20',
 'y': '2122',
 'i': '00',
 's': '221',
 'w': '2102',
 'b': '2100',
 'g': '2121',
 'h': '211',
 'u': '120',
 'p': '2101',
 'q': '212010',
 'k': '21200',
 'v': '21202',
 'z': '2120111',
 'x': '212012',
 'j': '2120112'}

**Calculate the number of ternary digits needed** to encode the letters in *Flatland* (converted to lower case) using `flatland_huffman_codes`. Store the result in `flatland_digit_ct_huffman`.

In [303]:
flatland_freq

Counter({'f': 3694,
         'l': 6402,
         'a': 11460,
         't': 13582,
         'n': 10814,
         'd': 5686,
         'r': 8706,
         'o': 11790,
         'm': 3986,
         'c': 4438,
         'e': 18666,
         'y': 3125,
         'i': 11154,
         's': 9785,
         'w': 2665,
         'b': 2148,
         'g': 2810,
         'h': 7955,
         'u': 4488,
         'p': 2588,
         'q': 193,
         'k': 502,
         'v': 1418,
         'z': 85,
         'x': 379,
         'j': 125})

In [305]:
total = 0
for char in flatland_freq:
    total += flatland_freq[char] * len(flatland_huffman_codes[char])
total

399012

In [306]:
flatland_digit_ct_huffman = total