What are all the ways to translate English into Binary? <br>What is the best way? <br>What would be a good measure of "best-ness"?
-----

(Brian came up with 4 ways)

1 hot encoding
------

A 0000000000000000000000000  
B 0000000000000000000000001  
C 0000000000000000000000010  
…  
Z 1000000000000000000000000  

<center><img src="http://littlebinsforlittlehands.com/wp-content/uploads/2015/12/ASCII-Binary-Alphabet-Coding-for-Kids-Activity-4-680x880.jpg" height="500"/></center>

In [1]:
# Convert ASCII to binary
bin(int.from_bytes('a'.encode(), 'big'))

'0b1100001'

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/International_Morse_Code.svg/1200px-International_Morse_Code.svg.png" height="500"/></center>

What is the efficency?
----

- 1-hot
- ASCII
- Morse

<br>
__Is there more efficent encoding of letters into binary?__

Hoffman encoding
------

<center><img src="http://lh5.ggpht.com/-SXfsAFiJTJM/UKk4yVwewoI/AAAAAAAAB9I/AAlY_q1bRk4/clip_image007_thumb%25255B1%25255D.gif?imgmax=800" height="500"/></center>

[Huffman encoding animation](https://people.ok.ubc.ca/ylucet/DS/Huffman.html)

What data structures do we need to implement a Huffman encoding?

Counter and Balancing Binary Tree / Priority Queue 

In [33]:
reset -fs

In [2]:
# HT: https://rosettacode.org/wiki/Huffman_coding#Python
from heapq import heappush, heappop, heapify

def encode(symbol_frequency_dict):
    "Huffman encode the given dictionary mapping of symbols and weights,"
    heap = [[wieght, [symbol, ""]] for symbol, wieght in symbol_frequency_dict.items()]
    heapify(heap)
    while len(heap) > 1:
        low = heappop(heap)
        high = heappop(heap)
        for pair in low[1:]:
            pair[1] = '0' + pair[1]
        for pair in high[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [low[0] + high[0]] + low[1:] + high[1:])
    return sorted(heappop(heap)[1:], key=lambda p: (len(p[-1]), p))

In [6]:
from collections import Counter
from string import ascii_lowercase as letters 
    
# message = "ab"
# message = "abc"
# message = "abcc"
message = letters
# message = open('/Users/brian/Desktop/shakespeare_all.txt', encoding='utf-8').read().lower().replace("\n", "")
symbol_frequency_dict = Counter(message)

huffman = encode(symbol_frequency_dict)

print("Symbol\tWeight\tHuffman Code")
for symbol, encoding in huffman:
    print(f"{symbol}\t{symbol_frequency_dict[symbol]}\t{encoding}")

Symbol	Weight	Huffman Code
u	1	0000
v	1	0001
w	1	0010
x	1	0011
y	1	0100
z	1	0101
a	1	01100
b	1	01101
c	1	01110
d	1	01111
e	1	10000
f	1	10001
g	1	10010
h	1	10011
i	1	10100
j	1	10101
k	1	10110
l	1	10111
m	1	11000
n	1	11001
o	1	11010
p	1	11011
q	1	11100
r	1	11101
s	1	11110
t	1	11111


Shakespeare encoding
-----
```
Symbol	Weight	Huffman Code
 	1281600	10
a	288594	0011
e	446147	1110
i	253327	0001
o	313890	0101
s	248518	0000
t	328987	0111
d	149127	01001
h	236585	11011
l	169658	01101
n	242749	11111
r	237250	11110
u	128697	00101
,	83064	110000
.	77922	011000
b	61786	001000
c	87839	110001
f	80333	011001
g	68054	010000
m	111222	110101
w	89286	110010
y	94173	110011
'	31067	0010010
k	35362	0100010
p	58249	1101001
v	37496	0100011
;	17194	00100111
!	8827	110100000
-	8058	001001101
?	10475	110100010
j	4752	1101000011
q	3577	0010011000
x	5217	1101000110
:	1810	00100110010
[	2071	11010000101
]	2063	11010000100
1	891	001001100110
9	921	001001100111
z	1626	110100011110
(	599	1101000111011
)	598	1101000111010
<	462	1101000111000
"	450	11010001111111
0	261	11010001110010
2	340	11010001111100
3	316	11010001110011
>	434	11010001111110
4	88	1101000111110101
5	78	1101000111110100
6	58	11010001111101100
_	68	11010001111101110
7	37	110100011111011111
8	33	110100011111011011
|	33	110100011111011110
&	21	1101000111110110101
}	2	11010001111101101001
`	1	110100011111011010000
﻿	1	110100011111011010001
    ```

Why are efficient encodings useful?
------

Symbols are take up a lot of space. "Strings are heavy". 

(Weighted) numerical representations are better.

[Learn more about word2vec and Hoffman encodings](http://www.trevorsimonton.com/blog/2016/12/15/huffman-tree-in-word2vec.html)