# Tokenization

Karpathy Video
https://youtu.be/zduSFxRajkE

Tokenizer Webapp
https://tiktokenizer.vercel.app/

- non english text tends to have more tokens per sentence than english which explodes the context.
- whitespace tokenization has a similar "bloating the context" effect for leading whitespaces in code.
- gpt4 tokenizer grouped whitespaces, effectively "densifying" python

In [1]:
# Unicode of any character
ord('a')

97

In [2]:
for c in "Sentence with an emoji 😊":
    print(c, ord(c))

S 83
e 101
n 110
t 116
e 101
n 110
c 99
e 101
  32
w 119
i 105
t 116
h 104
  32
a 97
n 110
  32
e 101
m 109
o 111
j 106
i 105
  32
😊 128522


- UTF-8 is the most prefered encoding, and is the only encoding that is backwards compatible to ASCII
- utf8 encodes in one to four bytes
- utf16 and utf32 are very wasteful in terms of encoding
- utf8 has a very small vocabulary size, which would "stretch" the context very far, which won't allow us to attend to sufficiently large context

In [4]:
list("sentence with an emoji 😊".encode('utf-8'))

[115,
 101,
 110,
 116,
 101,
 110,
 99,
 101,
 32,
 119,
 105,
 116,
 104,
 32,
 97,
 110,
 32,
 101,
 109,
 111,
 106,
 105,
 32,
 240,
 159,
 152,
 138]

- BPE will iteratively find the pair of tokens that occur most frequently and merge them into a single token. This is done iteratively until the vocabulary size is reached.

In [6]:
text = "this is an example text, here is an emoji 😊"
tokens = [int(t) for t in text.encode('utf-8')]
print(f"raw text of size {len(text)}")
print(text)
print(f"encoded text of size {len(tokens)}")
print(tokens)


raw text of size 43
this is an example text, here is an emoji 😊
encoded text of size 46
[116, 104, 105, 115, 32, 105, 115, 32, 97, 110, 32, 101, 120, 97, 109, 112, 108, 101, 32, 116, 101, 120, 116, 44, 32, 104, 101, 114, 101, 32, 105, 115, 32, 97, 110, 32, 101, 109, 111, 106, 105, 32, 240, 159, 152, 138]


In [8]:

def pair_frequency(tokens: list):
    pairs = {}
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i + 1])
        if pair in pairs:
            pairs[pair] += 1
        else:
            pairs[pair] = 1
    sorted_pairs = sorted(pairs.items(), key=lambda x: x[1], reverse=True)
    return sorted_pairs

print(pair_frequency(tokens))

[((105, 115), 3), ((115, 32), 3), ((32, 105), 2), ((32, 97), 2), ((97, 110), 2), ((110, 32), 2), ((32, 101), 2), ((101, 120), 2), ((101, 32), 2), ((116, 104), 1), ((104, 105), 1), ((120, 97), 1), ((97, 109), 1), ((109, 112), 1), ((112, 108), 1), ((108, 101), 1), ((32, 116), 1), ((116, 101), 1), ((120, 116), 1), ((116, 44), 1), ((44, 32), 1), ((32, 104), 1), ((104, 101), 1), ((101, 114), 1), ((114, 101), 1), ((101, 109), 1), ((109, 111), 1), ((111, 106), 1), ((106, 105), 1), ((105, 32), 1), ((32, 240), 1), ((240, 159), 1), ((159, 152), 1), ((152, 138), 1)]


- There is some sweetspot between vocabulary size and sequence length
- gpt4 uses roughly 100k token vocabulary
- tokenizer has its own training set which is used to determine the vocabulary
- tokenizer training set has a different mixture than the model training set
- unicode apostrophe causes issues: ` vs '

In [None]:
def decode(token_ids: list, vocab: dict):
    tokens = [vocab[t] for t in token_ids]
    text = b"".join(tokens).decode('utf-8', errors='replace') # openai also uses error replace
    return text

def encode(text: str, vocab: dict):
    bytes = list(text.encode('utf-8'))
    # have to re-merge the pairs in the order they were split
    for _, pair in pair_frequency(bytes):
        pass
    return tokens

In [None]:
import tiktoken # The openai tokenizer

# in the gpt4 tokenizer they changed the regex used to chunk the text before bpe
# numbers are not merged longer than 3 digits
# byte-encode > encode > decode > byte-decode
# tiktoken library is implemented in rust

# special token <|im_start|> <|im_end|> stand for "imaginary monologue start" and "imaginary monologue end"
# Adding special tokens will require "model surgery" where you have to add an extra row to the embedding table and extend the classification head at the very end by one

# The two layers that change with vocab size
self.token_embedding_table = nn.Embedding(vocab_size, emb_size)
self.lm_head = nn.Linear(emb_size, vocab_size)

# when you want to add new tokens for instruction tuning, or "using the browser"
# you freeze the base model and only push gradients into the layers above

# trailing white spaces heavily limit the next token prediction because most tokens have a leading whitespace
# SolidGoldMagikarp is a reddit username that was in the tokenization dataset, so it got merged into a token
# but because it doesn't appear in the training dataset, it never receives gradients and thus the embedding is a randomly initialized vector

In [None]:
# sentencepiece is used for llama and mistral
# unk token is the "unknown"
# sentencepiece has alot of "legacy baggage"