### Tokenization:
Tokenization is the process of taking text input and breaking them down into smaller inputs: characters, words, subwords.
These are then mapped to IDs, and form the model's vocabulary, the set of all possible tokens.

This notebook aims to implement a tokenizer. After this, I plan to develop an embedding model, which takes these tokens and maps them to vectors in a semantic space, to be fed to a model.

A few important observations: 
* the same concept might have 2+ different token mapping depending on the context: lower case, upper case, in the beginning of the sentence, at the end of the sentence, etc.

* non-English languages end up having shorter tokens because there is less data from them ---> more tokens/more chunks for the same concepts.  This is particularly bad for context windows, in self-attention since there are more tokens in a simple sentence.

* for coding, indentation causes ineficiencies stemming from wasteful use of the context window using one token per space.

A good tokenizer can decrease the number of tokens per sentence while not increasing the vocabulary too much.

In [4]:
# Accessing the Unicode code points:
ord('a'), ord('あ'), ord('🤔'), ord('青')

(97, 12354, 129300, 38738)

In [7]:
# Encodings translate the unicode point to a byte string -> UTF-8 encodes it to 8 bytes. 
list("What stands in the way becomes the way.".encode('utf-8'))
# Using UTF is inadequate, because the words are mapped to many bytes, so the context used would be too large, 
# and the byte-wise prediction too myopic.

[87,
 104,
 97,
 116,
 32,
 115,
 116,
 97,
 110,
 100,
 115,
 32,
 105,
 110,
 32,
 116,
 104,
 101,
 32,
 119,
 97,
 121,
 32,
 98,
 101,
 99,
 111,
 109,
 101,
 115,
 32,
 116,
 104,
 101,
 32,
 119,
 97,
 121,
 46]

#### BPE: Byte-Pair Encoding

BPE finds the pair of tokens that occur most frequently, iteratively, appending this token to the vocab.

aaabdaaabac

aa is mapped into Z.
Then, the sentence becomes ZabdZabac. ab occurs most frequently, so it turns into Y:

ZYdZYac

ZY appears most frequently, mapping it into X：

XdXac

The most frequent pair only appears once, so the algorithm terminates.
* X=ZY
* Y=ab
* Z=aa

The resulting sequence is compressed.


In [20]:
# Implementation:
text= """I tell you: one must still have chaos in one, to give birth to a dancing star. I tell you: ye have still chaos in you.
Alas! There cometh the time when man will no longer give birth to any star. Alas! There cometh the time of the most despicable man, who can no longer despise himself.
Lo! I show you THE LAST MAN.
“What is love? What is creation? What is longing? What is a star?”—so asketh the last man and blinketh.🔥 """
tokens = text.encode("utf-8") # get the bytes
tokens = list(map(int,tokens)) # convert them into a list of 0..255
print(tokens)

[73, 32, 116, 101, 108, 108, 32, 121, 111, 117, 58, 32, 111, 110, 101, 32, 109, 117, 115, 116, 32, 115, 116, 105, 108, 108, 32, 104, 97, 118, 101, 32, 99, 104, 97, 111, 115, 32, 105, 110, 32, 111, 110, 101, 44, 32, 116, 111, 32, 103, 105, 118, 101, 32, 98, 105, 114, 116, 104, 32, 116, 111, 32, 97, 32, 100, 97, 110, 99, 105, 110, 103, 32, 115, 116, 97, 114, 46, 32, 73, 32, 116, 101, 108, 108, 32, 121, 111, 117, 58, 32, 121, 101, 32, 104, 97, 118, 101, 32, 115, 116, 105, 108, 108, 32, 99, 104, 97, 111, 115, 32, 105, 110, 32, 121, 111, 117, 46, 10, 65, 108, 97, 115, 33, 32, 84, 104, 101, 114, 101, 32, 99, 111, 109, 101, 116, 104, 32, 116, 104, 101, 32, 116, 105, 109, 101, 32, 119, 104, 101, 110, 32, 109, 97, 110, 32, 119, 105, 108, 108, 32, 110, 111, 32, 108, 111, 110, 103, 101, 114, 32, 103, 105, 118, 101, 32, 98, 105, 114, 116, 104, 32, 116, 111, 32, 97, 110, 121, 32, 115, 116, 97, 114, 46, 32, 65, 108, 97, 115, 33, 32, 84, 104, 101, 114, 101, 32, 99, 111, 109, 101, 116, 104, 32, 116, 1

In [21]:
len(text), len(tokens) # some characters are mapped into more than one byte, like the emoji

(420, 429)

In [22]:
def get_stats (text):
    counts = {}
    for pair in zip(text,text[1:]):
        counts[pair] = counts.get(pair,0) +1
    return counts

stats = get_stats(tokens)
print(sorted(((v,k) for k,v in stats.items()), reverse= True))

[(16, (101, 32)), (11, (32, 116)), (10, (116, 104)), (8, (115, 116)), (8, (104, 97)), (7, (116, 32)), (7, (111, 32)), (7, (104, 101)), (7, (97, 110)), (6, (115, 32)), (6, (111, 110)), (6, (110, 32)), (6, (32, 115)), (6, (32, 105)), (6, (32, 99)), (5, (118, 101)), (5, (116, 105)), (5, (110, 103)), (5, (108, 108)), (5, (108, 32)), (5, (105, 115)), (5, (105, 110)), (5, (104, 32)), (5, (97, 116)), (5, (32, 121)), (5, (32, 109)), (5, (32, 108)), (5, (32, 97)), (4, (121, 111)), (4, (111, 117)), (4, (109, 101)), (4, (108, 111)), (4, (101, 116)), (4, (101, 114)), (4, (97, 115)), (4, (87, 104)), (3, (226, 128)), (3, (116, 111)), (3, (116, 97)), (3, (114, 101)), (3, (111, 115)), (3, (109, 97)), (3, (108, 97)), (3, (105, 109)), (3, (105, 108)), (3, (103, 105)), (3, (101, 108)), (3, (97, 114)), (3, (73, 32)), (3, (63, 32)), (3, (46, 10)), (3, (33, 32)), (3, (32, 119)), (3, (32, 111)), (3, (32, 104)), (3, (32, 100)), (3, (32, 98)), (3, (32, 87)), (3, (32, 84)), (2, (119, 104)), (2, (117, 58)), (2, 

In [23]:
chr(101), chr(32) # the most common pair

('e', ' ')

In [25]:
# creating new tokens, iterating
top_pair = max(stats, key=stats.get)
top_pair

101

In [30]:
def merge(ids, pair, idx):
    newids =[]
    i = 0
    while i <len(ids):
        if i< len(ids) -1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i=i+2
        else: 
            newids.append(ids[i])
            i= i +1
    return newids

print(merge([1,2,2,4,5,6],(2,4),120)) # replaces (2,4) with 120

[1, 2, 120, 5, 6]


In [31]:
tokens2 = merge(tokens, top_pair, 256)
print(tokens2)
len(tokens), len(tokens2) # reduced size

[73, 32, 116, 101, 108, 108, 32, 121, 111, 117, 58, 32, 111, 110, 256, 109, 117, 115, 116, 32, 115, 116, 105, 108, 108, 32, 104, 97, 118, 256, 99, 104, 97, 111, 115, 32, 105, 110, 32, 111, 110, 101, 44, 32, 116, 111, 32, 103, 105, 118, 256, 98, 105, 114, 116, 104, 32, 116, 111, 32, 97, 32, 100, 97, 110, 99, 105, 110, 103, 32, 115, 116, 97, 114, 46, 32, 73, 32, 116, 101, 108, 108, 32, 121, 111, 117, 58, 32, 121, 256, 104, 97, 118, 256, 115, 116, 105, 108, 108, 32, 99, 104, 97, 111, 115, 32, 105, 110, 32, 121, 111, 117, 46, 10, 65, 108, 97, 115, 33, 32, 84, 104, 101, 114, 256, 99, 111, 109, 101, 116, 104, 32, 116, 104, 256, 116, 105, 109, 256, 119, 104, 101, 110, 32, 109, 97, 110, 32, 119, 105, 108, 108, 32, 110, 111, 32, 108, 111, 110, 103, 101, 114, 32, 103, 105, 118, 256, 98, 105, 114, 116, 104, 32, 116, 111, 32, 97, 110, 121, 32, 115, 116, 97, 114, 46, 32, 65, 108, 97, 115, 33, 32, 84, 104, 101, 114, 256, 99, 111, 109, 101, 116, 104, 32, 116, 104, 256, 116, 105, 109, 256, 111, 102, 3

(429, 413)