# About Huffman Trees and Codes
## Divide Pair Conquer
### Due: Monday, 1 March 2021, 11:59 pm

*Matthew Reed*

In collaboration with:
- Bretton Steiner
(I'm pretty sure this is right, but I forgot to write this down)

## Goal

Review Huffman Trees and Codes from DM1 to get ready for your Ponder and Prove assignment.

In [1]:
from math import ceil, log
from collections import Counter

def show_results(message, code_tuples):
  total_characters = len(message)
  total_unique_characters = len(code_tuples)
  total_bits = 0
  for char, count, code in code_tuples:
    total_bits += count * len(code)
  average_bits_per_character = total_bits / total_characters
  fixed_bits_per_character = ceil(log(total_unique_characters, 2))
  total_fixed_bits = total_characters * fixed_bits_per_character
  compression_ratio = (total_fixed_bits - total_bits) / total_fixed_bits
  print(f'          Total Characters: {total_characters}')
  print(f'   Total Unique Characters: {total_unique_characters}')
  print(f'                Total Bits: {total_bits}')
  print(f'Average Bits per Character: {average_bits_per_character:.2f}')
  print(f'  Fixed Bits per Character: {fixed_bits_per_character}')
  print(f'          Total Fixed Bits: {total_fixed_bits}')
  print(f'         Compression Ratio: {compression_ratio:.3f}')

message1 = 'thebookofmormon'
counter1 = Counter(message1)

print(message1, '-->', counter1)

message2 = 'therestoration'

counter2 = Counter(message2)

print(message2, '-->', counter2)

thebookofmormon --> Counter({'o': 5, 'm': 2, 't': 1, 'h': 1, 'e': 1, 'b': 1, 'k': 1, 'f': 1, 'r': 1, 'n': 1})
therestoration --> Counter({'t': 3, 'e': 2, 'r': 2, 'o': 2, 'h': 1, 's': 1, 'a': 1, 'i': 1, 'n': 1})


### Which message has the lower compression ratio?

#### Message 1

Do all the steps, like the examples in the book, first sorting the counted occurrences:

| Char | # |
|------|---|
|   b  | 1 |
|   e  | 1 |
|   f  | 1 |
|   h  | 1 |
|   k  | 1 |
|   n  | 1 |
|   r  | 1 |
|   t  | 1 |
|   m  | 2 |
|   o  | 5 |

##### The ever-shrinking queue:

* b1 e1 f1 h1 k1 n1 r1 t1 m2 o5
* f1 h1 k1 n1 r1 t1 m2 be2 o5
* k1 n1 r1 t1 m2 be2 fh2 o5
* r1 t1 m2 be2 fh2 kn2 o5
* m2 be2 fh2 kn2 rt2 o5
* fh2 kn2 rt2 mbe4 o5
* rt2 meb4 fhkn4 o5
* fhkn4 o5 rtmeb6
* rtmbe6 fhkno9
* rtmbefhkno15

##### The Huffman Tree:

In [2]:
'''
       rtmbefhkno15
         /        \
     rtmbe6      fhkno9
     /   \        /    \
  rt2   mbe4   fhkn4   o5
  /\    / \     /   \
r1 t1 m2 be2  fh2   kn2
         / \  / \   / \
       b1 e1 f1 h1 k1 n1
'''

'\n       rtmbefhkno15\n         /             rtmbe6      fhkno9\n     /   \\        /      rt2   mbe4   fhkn4   o5\n  /\\    / \\     /   r1 t1 m2 be2  fh2   kn2\n         / \\  / \\   /        b1 e1 f1 h1 k1 n1\n'

##### The Code Tuples

Read the codes from the tree:

In [3]:
message1_code_tuples = \
[('b', 1, '0110'),
 ('e', 1, '0111'),
 ('f', 1, '1000'),
 ('h', 1, '1001'),
 ('k', 1, '1010'),
 ('m', 2, '010'),
 ('n', 1, '1011'),
 ('o', 5, '11'),
 ('r', 1, '000'),
 ('t', 1, '001'),
]

show_results(message1, message1_code_tuples)

          Total Characters: 15
   Total Unique Characters: 10
                Total Bits: 46
Average Bits per Character: 3.07
  Fixed Bits per Character: 4
          Total Fixed Bits: 60
         Compression Ratio: 0.233


#### Message 2

Do all the steps, like the examples in the book, first sorting the counted occurrences:

| Char | # |
|------|---|
|   a  | 1 |
|   h  | 1 |
|   i  | 1 |
|   n  | 1 |
|   s  | 1 |
|   e  | 2 |
|   o  | 2 |
|   r  | 2 |
|   t  | 3 |

##### The ever-shrinking queue:

* a1 h1 i1 n1 s1 e2 o2 r2 t3
* i1 n1 s1 e2 o2 r2 ah2 t3
* s1 e2 o2 r2 ah2 in2 t3
* o2 r2 ah2 in2 t3 se3
* ah2 in2 t3 se3 or4
* t3 se3 or4 ahin4
* or4 ahin4 tse6
* tse6 orahin8
* tseorahin14

##### The Huffman Tree:

In [4]:
'''
    tseorahin14
    /        \
 tse6     orahin8
  / \      /    \
t3 se3   or4   ahin4
   / \   / \    /   \
  s1 e2 o2 r2 ah2   in2
              / \   / \
             a1 h1 i1 n1
'''

'\n    tseorahin14\n    /         tse6     orahin8\n  / \\      /    t3 se3   or4   ahin4\n   / \\   / \\    /     s1 e2 o2 r2 ah2   in2\n              / \\   /              a1 h1 i1 n1\n'

##### The Code Tuples

Read the codes from the tree:

In [5]:
message2_code_tuples = \
[('a', 1, '1100'),
 ('e', 2, '011'),
 ('h', 1, '1101'),
 ('i', 1, '1110'),
 ('n', 1, '1111'),
 ('o', 2, '100'),
 ('r', 2, '101'),
 ('s', 1, '010'),
 ('t', 3, '00'),
]

show_results(message2, message2_code_tuples)

          Total Characters: 14
   Total Unique Characters: 9
                Total Bits: 43
Average Bits per Character: 3.07
  Fixed Bits per Character: 4
          Total Fixed Bits: 56
         Compression Ratio: 0.232


### Create Data Tree and Code

More warmup for your Ponder and Prove assignment this week:

Create a Huffman Tree and codes for the gaps between the first few prime (except for the gap of size 1 between 2 and 3). Your goal is to find how many is "few" enough to have a compression ratio **better than 24%**.


### Gaining Understanding / Setting Things Up

In [6]:
from sympy import primerange

list_of_gaps = []
prev = 3
gap = 0
for prime in list(primerange(4, 201)):
    gap = prime - prev
    #print(gap)
    prev = prime
    list_of_gaps.append(gap)

print(list_of_gaps)

[2, 2, 4, 2, 4, 2, 4, 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2, 6, 4, 6, 8, 4, 2, 4, 2, 4, 14, 4, 6, 2, 10, 2, 6, 6, 4, 6, 6, 2, 10, 2, 4, 2]


In [7]:
# https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item
gap_count = [[list_of_gaps.count(x),x] for x in set(list_of_gaps)]
gap_count

[[15, 2], [13, 4], [12, 6], [1, 8], [2, 10], [1, 14]]

In [8]:
gap_count.sort(key=lambda x: x[0])
gap_count

[[1, 8], [1, 14], [2, 10], [12, 6], [13, 4], [15, 2]]

In [9]:
def es_queue(l):
  l.sort(key=lambda x: x[0])
  if len(l) == 1:
    return l[0]
  else:
    return es_queue([[l[0][0] + l[1][0], [l[0],l[1]]]] + l[2:])

tree = es_queue(gap_count)

In [10]:
def get_tuples(l, path=''):
  # global gap_tuples
  # print(l, path)
  tuples = []
  if type(l[1]) is list:
    tuples = tuples + get_tuples(l[1][0], path+'0')
    tuples = tuples + get_tuples(l[1][1], path+'1')
    return tuples
  else:
    return [(l[1], l[0], path)]

In [11]:
gap_tuples = get_tuples(tree)
gap_tuples

[(8, 1, '0000'),
 (14, 1, '0001'),
 (10, 2, '001'),
 (6, 12, '01'),
 (4, 13, '10'),
 (2, 15, '11')]

In [12]:
show_results(list_of_gaps, gap_tuples)

          Total Characters: 44
   Total Unique Characters: 6
                Total Bits: 94
Average Bits per Character: 2.14
  Fixed Bits per Character: 3
          Total Fixed Bits: 132
         Compression Ratio: 0.288


In [13]:
str_bits = ''.join([{x[0]:x[2] for x in gap_tuples}[y] for y in list_of_gaps])
str_bits

'1111101110111001110110111001011101101101100100001011101110000110011100111010110010111001111011'

In [14]:
len(str_bits)

94

### Get the Number

In [15]:
def test_compression_on_range(max_prime):
  # Generate List of Gaps
  list_of_gaps = []
  prev = 3
  gap = 0
  for prime in list(primerange(4, max_prime + 1)):
      gap = prime - prev
      #print(gap)
      prev = prime
      list_of_gaps.append(gap)

  # Get Tree
  gap_count = [[list_of_gaps.count(x),x] for x in set(list_of_gaps)]
  tree = es_queue(gap_count)
  gap_tuples = get_tuples(tree)

  # Get Ratio
  total_characters = len(list_of_gaps)
  total_unique_characters = len(gap_tuples)
  total_bits = 0
  for char, count, code in gap_tuples:
    total_bits += count * len(code)
  average_bits_per_character = total_bits / total_characters
  fixed_bits_per_character = ceil(log(total_unique_characters, 2))
  total_fixed_bits = total_characters * fixed_bits_per_character
  compression_ratio = (total_fixed_bits - total_bits) / total_fixed_bits

  return compression_ratio

In [16]:
for i in list(primerange(8, 201)):
  compression = test_compression_on_range(i)
  if compression >= 0.24:
    print(f'i={i}, compression={compression:.3f}')
    break

i=29, compression=0.250


In [18]:
for i in list(primerange(8, 301)):
  compression = test_compression_on_range(i)
  if compression < 0.24:
    print('---', end='')
  print(f'i={i}, compression={compression:.3f}')

---i=11, compression=0.000
---i=13, compression=0.000
---i=17, compression=0.000
---i=19, compression=0.000
---i=23, compression=0.000
i=29, compression=0.250
i=31, compression=0.278
i=37, compression=0.250
---i=41, compression=0.227
i=43, compression=0.250
---i=47, compression=0.231
---i=53, compression=0.214
---i=59, compression=0.200
---i=61, compression=0.219
---i=67, compression=0.206
---i=71, compression=0.194
---i=73, compression=0.211
---i=79, compression=0.200
---i=83, compression=0.190
---i=89, compression=0.182
---i=97, compression=0.000
---i=101, compression=0.000
---i=103, compression=0.020
---i=107, compression=0.019
---i=109, compression=0.037
---i=113, compression=0.036
i=127, compression=0.322
i=131, compression=0.333
i=137, compression=0.323
i=139, compression=0.323
i=149, compression=0.283
i=151, compression=0.294
i=157, compression=0.286
i=163, compression=0.287
i=167, compression=0.288
i=173, compression=0.289
i=179, compression=0.291
i=181, compression=0.292
i=191

29 is the smallest prime that has a compression ratio of at least 24%. **127** is the prime for which the compression ratio for primes equal to and above it is at least 24%.