# GPT-2

---
### Bytes to Unicode

Interesting video [here](https://www.youtube.com/watch?v=MijmeoH9LT4), recommended on the [Python documentation page](https://docs.python.org/3/howto/unicode.html).

In [56]:
from collections import OrderedDict
from functools import lru_cache 
import pprint
pp = pprint.PrettyPrinter(indent=2)

In [45]:
help(lru_cache)

Help on function lru_cache in module functools:

lru_cache(maxsize=128, typed=False)
    Least-recently-used cache decorator.
    
    If *maxsize* is set to None, the LRU features are disabled and the cache
    can grow without bound.
    
    If *typed* is True, arguments of different types will be cached separately.
    For example, f(3.0) and f(3) will be treated as distinct calls with
    distinct results.
    
    Arguments to the cached function must be hashable.
    
    View the cache statistics named tuple (hits, misses, maxsize, currsize)
    with f.cache_info().  Clear the cache and statistics with f.cache_clear().
    Access the underlying function with f.__wrapped__.
    
    See:  http://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)



In [46]:
@lru_cache()
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe (Byte Pair Encoding) codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    (N.B.: <UNK> is used in many datasets as a placeholder for 'unknown' (e.g. words).)
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    
    # ord: returns integer corresponding to Unicode character
    # the list of unicodes, but without spaces
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]

    n = 0
                  # 256
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1

    # chr: returns string corresponding to Unicode integer code 
    # (of such and such character)
    # replace integer codes by their characters
    cs = [chr(n) for n in cs]

    # return the dict { 33: '!', 34: '"', ... }
    return dict(zip(bs, cs))

Quick recap: `ord` gives you the unicode point number, `char` the character for the according number.

In [4]:
print(ord("!"), 'is', chr(33))

33 is !


## The Ranges

In [5]:
range1 = list(range(ord("!"), ord("~")+1))
chars1 = [chr(c) for c in range1]
range2 = list(range(ord("¡"), ord("¬")+1))
chars2 = [chr(c) for c in range2]
range3 =list(range(ord("®"), ord("ÿ")+1))
chars3 = [chr(c) for c in range3]

In [6]:
max(len(range1), len(range2), len(range3))

94

In [7]:
print(*[f'{x:<3}: {y}' for x, y in zip(range1, chars1)], sep='\t')
print()

33 : !	34 : "	35 : #	36 : $	37 : %	38 : &	39 : '	40 : (	41 : )	42 : *	43 : +	44 : ,	45 : -	46 : .	47 : /	48 : 0	49 : 1	50 : 2	51 : 3	52 : 4	53 : 5	54 : 6	55 : 7	56 : 8	57 : 9	58 : :	59 : ;	60 : <	61 : =	62 : >	63 : ?	64 : @	65 : A	66 : B	67 : C	68 : D	69 : E	70 : F	71 : G	72 : H	73 : I	74 : J	75 : K	76 : L	77 : M	78 : N	79 : O	80 : P	81 : Q	82 : R	83 : S	84 : T	85 : U	86 : V	87 : W	88 : X	89 : Y	90 : Z	91 : [	92 : \	93 : ]	94 : ^	95 : _	96 : `	97 : a	98 : b	99 : c	100: d	101: e	102: f	103: g	104: h	105: i	106: j	107: k	108: l	109: m	110: n	111: o	112: p	113: q	114: r	115: s	116: t	117: u	118: v	119: w	120: x	121: y	122: z	123: {	124: |	125: }	126: ~



In [8]:
print(*[f'{x:<3}: {y}' for x, y in zip(range2, chars2)], sep='\t')
print()

161: ¡	162: ¢	163: £	164: ¤	165: ¥	166: ¦	167: §	168: ¨	169: ©	170: ª	171: «	172: ¬



In [9]:
print(*[f'{x:<3}: {y}' for x, y in zip(range3, chars3)], sep='\t')
print()

174: ®	175: ¯	176: °	177: ±	178: ²	179: ³	180: ´	181: µ	182: ¶	183: ·	184: ¸	185: ¹	186: º	187: »	188: ¼	189: ½	190: ¾	191: ¿	192: À	193: Á	194: Â	195: Ã	196: Ä	197: Å	198: Æ	199: Ç	200: È	201: É	202: Ê	203: Ë	204: Ì	205: Í	206: Î	207: Ï	208: Ð	209: Ñ	210: Ò	211: Ó	212: Ô	213: Õ	214: Ö	215: ×	216: Ø	217: Ù	218: Ú	219: Û	220: Ü	221: Ý	222: Þ	223: ß	224: à	225: á	226: â	227: ã	228: ä	229: å	230: æ	231: ç	232: è	233: é	234: ê	235: ë	236: ì	237: í	238: î	239: ï	240: ð	241: ñ	242: ò	243: ó	244: ô	245: õ	246: ö	247: ÷	248: ø	249: ù	250: ú	251: û	252: ü	253: ý	254: þ	255: ÿ



---
## Space avoidance
The idea is to avoid the empty spaces (and other types of char beyond...).  
The first and last elements of the first two ranges are included in the vocab, but not the middle chars.

In [10]:
for i in range(126, 162):
    print(f'{i:<3}: {chr(i)}', end='\t')
print()
print()
for i in range(172, 175):
    print(f'{i:<3}: {chr(i)}', end='\t')
print()

126: ~	127: 	128: 	129: 	130: 	131: 	132: 	133: 	134: 	135: 	136: 	137: 	138: 	139: 	140: 	141: 	142: 	143: 	144: 	145: 	146: 	147: 	148: 	149: 	150: 	151: 	152: 	153: 	154: 	155: 	156: 	157: 	158: 	159: 	160:  	161: ¡	

172: ¬	173: ­	174: ®	


In [11]:
# behond ascii
for i in range(255, 300):
    print(f'{i:<3}: {chr(i)}', end='\t')

255: ÿ	256: Ā	257: ā	258: Ă	259: ă	260: Ą	261: ą	262: Ć	263: ć	264: Ĉ	265: ĉ	266: Ċ	267: ċ	268: Č	269: č	270: Ď	271: ď	272: Đ	273: đ	274: Ē	275: ē	276: Ĕ	277: ĕ	278: Ė	279: ė	280: Ę	281: ę	282: Ě	283: ě	284: Ĝ	285: ĝ	286: Ğ	287: ğ	288: Ġ	289: ġ	290: Ģ	291: ģ	292: Ĥ	293: ĥ	294: Ħ	295: ħ	296: Ĩ	297: ĩ	298: Ī	299: ī	

---
## Exploring Unicode
Just for fun:

In [12]:
bigrange = list(range(0,3000))
bigrchars = [chr(x) for x in bigrange]
print(*[f'{x}: {y} | ' for x,y in zip(bigrange, bigrchars)], sep='\t')

0:   | 	1:  | 	2:  | 	3:  | 	4:  | 	5:  | 	6:  | 	7:  | 	8: | 	9: 	 | 	10: 
 | 	14:  | 	15:  | 	16:  | 	17:  | 	18:  | 	19:  | 	20:  | 	21:  | 	22:  | 	23:  | 	24:  | 	25:  | 	26:  | 	27:  | 	28:  | 	29:  | 	30:  | 	31:  | 	32:   | 	33: ! | 	34: " | 	35: # | 	36: $ | 	37: % | 	38: & | 	39: ' | 	40: ( | 	41: ) | 	42: * | 	43: + | 	44: , | 	45: - | 	46: . | 	47: / | 	48: 0 | 	49: 1 | 	50: 2 | 	51: 3 | 	52: 4 | 	53: 5 | 	54: 6 | 	55: 7 | 	56: 8 | 	57: 9 | 	58: : | 	59: ; | 	60: < | 	61: = | 	62: > | 	63: ? | 	64: @ | 	65: A | 	66: B | 	67: C | 	68: D | 	69: E | 	70: F | 	71: G | 	72: H | 	73: I | 	74: J | 	75: K | 	76: L | 	77: M | 	78: N | 	79: O | 	80: P | 	81: Q | 	82: R | 	83: S | 	84: T | 	85: U | 	86: V | 	87: W | 	88: X | 	89: Y | 	90: Z | 	91: [ | 	92: \ | 	93: ] | 	94: ^ | 	95: _ | 	96: ` | 	97: a | 	98: b | 	99: c | 	100: d | 	101: e | 	102: f | 	103: g | 	104: h | 	105: i | 	106: j | 	107: k | 	108: l | 	109: m | 	110: n | 	111: o | 	112: p | 	113: q | 

---

## Mechanism

In [13]:
chr(0)

'\x00'

In [41]:
bb = bytearray([i for i in range(256)])
for b in bb:
    print(f"{b:3} | {b:>8b} | {repr(chr(b)).rjust(6):>5} | (unicode for {b}: {chr(b)})")

  0 |        0 | '\x00' | (unicode for 0:  )
  1 |        1 | '\x01' | (unicode for 1: )
  2 |       10 | '\x02' | (unicode for 2: )
  3 |       11 | '\x03' | (unicode for 3: )
  4 |      100 | '\x04' | (unicode for 4: )
  5 |      101 | '\x05' | (unicode for 5: )
  6 |      110 | '\x06' | (unicode for 6: )
  7 |      111 | '\x07' | (unicode for 7: )
  8 |     1000 | '\x08' | (unicode for 8:)
  9 |     1001 |   '\t' | (unicode for 9: 	)
 10 |     1010 |   '\n' | (unicode for 10: 
)
 11 |     1011 | '\x0b' | (unicode for 11: )
 12 |     1100 | '\x0c' | (unicode for 12: )
)13 |     1101 |   '\r' | (unicode for 13: 
 14 |     1110 | '\x0e' | (unicode for 14: )
 15 |     1111 | '\x0f' | (unicode for 15: )
 16 |    10000 | '\x10' | (unicode for 16: )
 17 |    10001 | '\x11' | (unicode for 17: )
 18 |    10010 | '\x12' | (unicode for 18: )
 19 |    10011 | '\x13' | (unicode for 19: )
 20 |    10100 | '\x14' | (unicode for 20: )
 21 |    10101 | '\x15' | (unicode for 21: )
 2

In [14]:
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:] 

n = 0
              # 256
for b in range(2**8):
    if b not in bs:
        print(f'discarding unicode {b:<3} ({repr(chr(b)):6}), and replacing it by unicode: {2**8 + n} ({chr(2**8 + n)})')
        bs.append(b)
        cs.append(2**8+n)
        n += 1

discarding unicode 0   ('\x00'), and replacing it by unicode: 256 (Ā)
discarding unicode 1   ('\x01'), and replacing it by unicode: 257 (ā)
discarding unicode 2   ('\x02'), and replacing it by unicode: 258 (Ă)
discarding unicode 3   ('\x03'), and replacing it by unicode: 259 (ă)
discarding unicode 4   ('\x04'), and replacing it by unicode: 260 (Ą)
discarding unicode 5   ('\x05'), and replacing it by unicode: 261 (ą)
discarding unicode 6   ('\x06'), and replacing it by unicode: 262 (Ć)
discarding unicode 7   ('\x07'), and replacing it by unicode: 263 (ć)
discarding unicode 8   ('\x08'), and replacing it by unicode: 264 (Ĉ)
discarding unicode 9   ('\t'  ), and replacing it by unicode: 265 (ĉ)
discarding unicode 10  ('\n'  ), and replacing it by unicode: 266 (Ċ)
discarding unicode 11  ('\x0b'), and replacing it by unicode: 267 (ċ)
discarding unicode 12  ('\x0c'), and replacing it by unicode: 268 (Č)
discarding unicode 13  ('\r'  ), and replacing it by unicode: 269 (č)
discarding unicode 1

In [15]:
print(bs)

[33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 127,

In [16]:
cs = [chr(n) for n in cs]
print(cs)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ'

---
Now the actual result (a dictionary):

In [62]:
btu = bytes_to_unicode()
print(len(btu))
print(btu)
# pp.pprint(btu)
# print(*sorted(list(btu.items())), sep='\n')

256
{33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 1

In [81]:
print(*[f"{k:3}: {v} | unicode: {ord(v):3}, same ? {k == ord(v)}" for k,v in btu.items()], sep='\n')

 33: ! | unicode:  33, same ? True
 34: " | unicode:  34, same ? True
 35: # | unicode:  35, same ? True
 36: $ | unicode:  36, same ? True
 37: % | unicode:  37, same ? True
 38: & | unicode:  38, same ? True
 39: ' | unicode:  39, same ? True
 40: ( | unicode:  40, same ? True
 41: ) | unicode:  41, same ? True
 42: * | unicode:  42, same ? True
 43: + | unicode:  43, same ? True
 44: , | unicode:  44, same ? True
 45: - | unicode:  45, same ? True
 46: . | unicode:  46, same ? True
 47: / | unicode:  47, same ? True
 48: 0 | unicode:  48, same ? True
 49: 1 | unicode:  49, same ? True
 50: 2 | unicode:  50, same ? True
 51: 3 | unicode:  51, same ? True
 52: 4 | unicode:  52, same ? True
 53: 5 | unicode:  53, same ? True
 54: 6 | unicode:  54, same ? True
 55: 7 | unicode:  55, same ? True
 56: 8 | unicode:  56, same ? True
 57: 9 | unicode:  57, same ? True
 58: : | unicode:  58, same ? True
 59: ; | unicode:  59, same ? True
 60: < | unicode:  60, same ? True
 61: = | unicode:  6

Now with keys in order:

In [103]:
print(*[f"{k:3}: {v} | binary: {k:8b} | as byte: {repr(bytes([k])):7} | unicode: ({ord(v):3}) | same? {k == ord(v)}" \
        for k,v in sorted(btu.items(), key=lambda t: t[0])], sep='\n')

  0: Ā | binary:        0 | as byte: b'\x00' | unicode: (256) | same? False
  1: ā | binary:        1 | as byte: b'\x01' | unicode: (257) | same? False
  2: Ă | binary:       10 | as byte: b'\x02' | unicode: (258) | same? False
  3: ă | binary:       11 | as byte: b'\x03' | unicode: (259) | same? False
  4: Ą | binary:      100 | as byte: b'\x04' | unicode: (260) | same? False
  5: ą | binary:      101 | as byte: b'\x05' | unicode: (261) | same? False
  6: Ć | binary:      110 | as byte: b'\x06' | unicode: (262) | same? False
  7: ć | binary:      111 | as byte: b'\x07' | unicode: (263) | same? False
  8: Ĉ | binary:     1000 | as byte: b'\x08' | unicode: (264) | same? False
  9: ĉ | binary:     1001 | as byte: b'\t'   | unicode: (265) | same? False
 10: Ċ | binary:     1010 | as byte: b'\n'   | unicode: (266) | same? False
 11: ċ | binary:     1011 | as byte: b'\x0b' | unicode: (267) | same? False
 12: Č | binary:     1100 | as byte: b'\x0c' | unicode: (268) | same? False
 13: č | bin

---

## BTU Modified

(produces the same beginning of vocab as in `encoder.json`.

In [18]:
@lru_cache()
def b_t_u():
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]

    n = 0

    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1

    cs = [chr(n) for n in cs]
    bs = [i for i in range(len(bs))]
    return dict(zip(bs, cs))

In [19]:
btu_test = b_t_u()
pp.pprint(btu_test)

{ 0: '!',
  1: '"',
  2: '#',
  3: '$',
  4: '%',
  5: '&',
  6: "'",
  7: '(',
  8: ')',
  9: '*',
  10: '+',
  11: ',',
  12: '-',
  13: '.',
  14: '/',
  15: '0',
  16: '1',
  17: '2',
  18: '3',
  19: '4',
  20: '5',
  21: '6',
  22: '7',
  23: '8',
  24: '9',
  25: ':',
  26: ';',
  27: '<',
  28: '=',
  29: '>',
  30: '?',
  31: '@',
  32: 'A',
  33: 'B',
  34: 'C',
  35: 'D',
  36: 'E',
  37: 'F',
  38: 'G',
  39: 'H',
  40: 'I',
  41: 'J',
  42: 'K',
  43: 'L',
  44: 'M',
  45: 'N',
  46: 'O',
  47: 'P',
  48: 'Q',
  49: 'R',
  50: 'S',
  51: 'T',
  52: 'U',
  53: 'V',
  54: 'W',
  55: 'X',
  56: 'Y',
  57: 'Z',
  58: '[',
  59: '\\',
  60: ']',
  61: '^',
  62: '_',
  63: '`',
  64: 'a',
  65: 'b',
  66: 'c',
  67: 'd',
  68: 'e',
  69: 'f',
  70: 'g',
  71: 'h',
  72: 'i',
  73: 'j',
  74: 'k',
  75: 'l',
  76: 'm',
  77: 'n',
  78: 'o',
  79: 'p',
  80: 'q',
  81: 'r',
  82: 's',
  83: 't',
  84: 'u',
  85: 'v',
  86: 'w',
  87: 'x',
  88: 'y',
  89: 'z',
  90: '{',
  91: '|

---
## N.B. The Ġ

One line explains the `Ġ` present everywhere: space is U+32. Shifted by 256, it becomes 288, which is Ġ. 

In [20]:
print(f'space, i.e. {repr(chr(32))}, is unicode number {ord(" ")}')
print(f'32 + 256 = {32+256}')
print(f'the weird g, i.e. {repr(chr(288))}, is unicode number {ord("Ġ")}')

space, i.e. ' ', is unicode number 32
32 + 256 = 288
the weird g, i.e. 'Ġ', is unicode number 288
