<a href="https://colab.research.google.com/github/nkthiebaut/guanaco/blob/main/notebooks/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

## Character encodings: Unicode and UTF-8

From https://docs.python.org/3/howto/unicode.html

“[Unicode is a] [...] specification that aims to list every character used by human languages and give each character its own unique code. [...] characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values [of which ~100k are currently assigned])”

Codepoint to glyph (the "drawing" of a character) conversion is handled by the GUI toolkit or a terminal’s font renderer (uses system typeface and fonts).

Unicode encodes strings as series of code points:
```
“MSDS” → [U+004D, U+0053, U+0044, U+0053]
```
Unicode Transformation Format (UTF) defines how to represent these in memory using code units. It comes in 3 flavors: UTF-8, UTF-16, UTF-32. UTF-8 is by far the most common.

For ASCII character (code point < 128), ASCII-code, Unicode code point, and UTF-8 representation are all the same. For example "a" is 97 (decimal) / 0x61 (hexadecimal) / 0b1100001 (binary).

In [None]:
char = "a"
code_point = ord(char)
print(f"Code point for {char}: {code_point} (dec) / {hex(code_point)} (hex)")

Code point for a: 97 (dec) / 0x61 (hex)


In [None]:
code_units = char.encode("utf-8")
print(f"UTF-8 code units: {code_units} -> {list(map(bin, code_units))}")

UTF-8 code units: b'a' -> ['0b1100001']


For all characters beyond the ASCII table (i.e. all non-English languages, emojis, math symbols, ...), UTF-8 uses **variable-length encoding** from 8 to 32 bits. For example for the "😉" character the Unicode code point is larger than 128, hence it is encoded with several code units.


In [None]:
char = "😉"
code_point = ord(char)
print(f"Code point for {char}: {code_point} (dec) / {hex(code_point)} (hex)")

Code point for 😉: 128521 (dec) / 0x1f609 (hex)


In [None]:
code_units = char.encode("utf-8")
print(f"UTF-8 code units: {code_units} -> {list(map(bin, code_units))}")

UTF-8 code units: b'\xf0\x9f\x98\x89' -> ['0b11110000', '0b10011111', '0b10011000', '0b10001001']


In [None]:
bin(128521)

'0b11111011000001001'

In the code units sequence, bytes starting with:
- `11110` indicate the beginning of a 4 bytes sequence
- `1110` indicate the beginning of a 3 bytes sequence
- `110` indicate the beginning of a 2 bytes sequence
- `0` indicate a single-byte encoding (ASCII character)

Bytes starting with `10` are follow-up bytes in a longer sequence.

⚠️ The UTF-8 code unit sequence (`0xf0 0x9f 0x98 0x89 = 11110000 10011111 10011000 10001001` in the last example) is different from the corresponding Unicode code point (`0x1f609 = 11111011000001001` for the last example). Also, not all valid UTF-8 correspond to assigned Unicode code points.

📝 _Exercise_: how many characters are encoded by the following sequence code units:

```01000011 01100001 01100110 11000011 10101001```

and this one:

```11110000 10011111 10001100 10001101```?

## Tokenization

In [None]:
%pip install -q tiktoken
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
token_ids = encoding.encode("hello world aaaaaaaaaaaa")
token_ids

[15339, 1917, 264, 70540, 33746]

In [None]:
encoding.decode(token_ids)

'hello world aaaaaaaaaaaa'

In [None]:
# Decode the token IDs to get the tokens
tokens = [encoding.decode_single_token_bytes(token_id) for token_id in token_ids]

# Display the token IDs and their corresponding tokens
for token_id, token in zip(token_ids, tokens):
    print(f"Token ID: {token_id}, Token: {token.decode('utf-8', errors='replace')}")



Token ID: 15339, Token: hello
Token ID: 1917, Token:  world
Token ID: 264, Token:  a
Token ID: 70540, Token: aaaaaaaa
Token ID: 33746, Token: aaa


Check https://tiktokenizer.vercel.app/ for a Tiktoken visualization.