# Tokenization

Tokenization is the process of breaking down text data into smaller, manageable units called tokens, which can be words, phrases, subwords, or even individual characters. This is typically the first step in preprocessing text for machine learning (ML) and natural language processing (NLP) tasks, as it transforms raw text into a format that algorithms can analyze and understand.


Tokenization - https://tiktokenizer.vercel.app/


Example string:

```
Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

127 + 677 = 804
1275 + 6773 = 8041

Egg.
I have an Egg.
egg.
EGG.

만나서 반가워요. 저는 OpenAI에서 개발한 대규모 언어 모델인 ChatGPT입니다. 궁금한 것이 있으시면 무엇이든 물어보세요.

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
```

---

## Why is Tokenization Important ? 


- Text to Numbers: In machine learning models operate on numerical data, not raw text. Tokenization converts text into tokens, which are then mapped to numbers so that models can process them.

- Pattern Recognition: By splitting text into tokens, algorithms can more easily identify patterns, relationships, and context within the data.

- Efficiency: Tokenization makes it possible to handle large volumes of text efficiently, optimizing memory usage and computational speed-especially important in large language models..


- Generalization: Good tokenization strategies, such as subword or character tokenization, allow models to handle new or rare words by breaking them into familiar components.


## Types of Tokenization:


-  Word Tokenization: Splits text into individual words.
Example: "Chatbots are helpful." → ["Chatbots", "are", "helpful"]

- Character Tokenization: Breaks text into individual characters.
Example: "Chatbots" → ["C", "h", "a", "t", "b", "o", "t", "s"]

- Subword Tokenization: Splits words into smaller units (subwords), useful for handling rare or unknown words.
Example: "unhappiness" → ["un", "happiness"] or ["un", "hap", "pi", "ness"]

- Sentence Tokenization: Divides text into sentences, often used for tasks like summarization or translation.


## Tokenization for LLM's:


- Tokenization is the gateway for all LLM operations-text is tokenized, converted to embeddings, processed, and then detokenized for output.


- LLMs often use subword tokenization (like Byte-Pair Encoding or WordPiece) to balance vocabulary size, efficiency, and the ability to handle new words.


- Tokenization quality directly impacts the model’s ability to understand context, manage multilingual data, and generate accurate responses.

  

## Popular Tokenization Algorithms


| **Algorithm/**              | **Description & Approach**                                                                            | **Typical Use Cases & Models**                                                                  |
|----------------------------------|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| **Whitespace/Regex**             | Splits text based on spaces or regular expressions.                                                   | Simple NLP tasks, preprocessing, rule-based systems.                                           |
| **Word Tokenizers**              | Divides text into words, often using libraries like NLTK, SpaCy, or Keras.                           | Text classification, sentiment analysis, topic modeling (Gensim), general NLP pipelines.       |
| **Character Tokenizers**        | Splits text into individual characters.                                                               | Handling misspellings, rare words, languages without clear word boundaries, deep learning models.|
| **Byte-Pair Encoding (BPE)**     | Iteratively merges frequent pairs of bytes/characters to form subwords.                              | Neural machine translation, GPT-2, RoBERTa, multilingual models.                             |
| **WordPiece**                    | Similar to BPE, but merges pairs that maximize likelihood of training data.                           | BERT, DistilBERT, Electra, other transformer models.                                            |
| **Unigram**                      | Starts with a large vocabulary, then prunes to optimize likelihood.                                  | Used with SentencePiece in models like T5, XLNet, ALBERT.                                      |
| **SentencePiece**                | Unsupervised, language-independent tokenizer supporting BPE and Unigram.                             | Neural machine translation, text generation, T5, XLNet; supports multiple languages.          |
| **Hugging Face Transformers**    | Library providing fast, model-specific tokenizers (BPE, WordPiece, Unigram).                         | BERT, GPT, RoBERTa, T5, and other transformer-based models.                                     |
| **Gensim Tokenizer**             | Tokenization for large-scale topic modeling and document similarity.                                 | Topic modeling, information retrieval.                                                        |
| **Keras Tokenizer**              | Converts text to sequences for deep learning pipelines.                                               | Text classification, sequence modeling in Keras and TensorFlow.                               |





---

1. WhiteSpaceTokenizer -> https://github.com/iamprasadraju/Natural_Language_Processing/blob/main/code/whitespace_tokenizer.py

2. WordTokenizer (TreebankWordTokenizer) -> https://github.com/iamprasadraju/Natural_Language_Processing/blob/main/code/Word_Tokenizer.py

In [4]:
"안녕하세요 👋 (hello in Korean!)"



'안녕하세요 👋 (hello in Korean!)'

In [5]:
ord("h")

104

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. (UTF - 8, UTF -16, UTF - 32)

Unicode - https://en.wikipedia.org/wiki/Unicode

**Why UTF-8 encoding in tokenization ?**

ref-1 : https://www.reedbeta.com/blog/programmers-intro-to-unicode

ref-2 : https://utf8everywhere.org/

In [6]:
[ord(x) for x in "안녕하세요 👋 (hello in Korean!)"]

[50504,
 45397,
 54616,
 49464,
 50836,
 32,
 128075,
 32,
 40,
 104,
 101,
 108,
 108,
 111,
 32,
 105,
 110,
 32,
 75,
 111,
 114,
 101,
 97,
 110,
 33,
 41]

In [7]:
list("안녕하세요 👋 (hello in Korean!".encode("utf-8"))

[236,
 149,
 136,
 235,
 133,
 149,
 237,
 149,
 152,
 236,
 132,
 184,
 236,
 154,
 148,
 32,
 240,
 159,
 145,
 139,
 32,
 40,
 104,
 101,
 108,
 108,
 111,
 32,
 105,
 110,
 32,
 75,
 111,
 114,
 101,
 97,
 110,
 33]

In [8]:
list("안녕하세요 👋 (hello in Korean!".encode("utf-16"))


[255,
 254,
 72,
 197,
 85,
 177,
 88,
 213,
 56,
 193,
 148,
 198,
 32,
 0,
 61,
 216,
 75,
 220,
 32,
 0,
 40,
 0,
 104,
 0,
 101,
 0,
 108,
 0,
 108,
 0,
 111,
 0,
 32,
 0,
 105,
 0,
 110,
 0,
 32,
 0,
 75,
 0,
 111,
 0,
 114,
 0,
 101,
 0,
 97,
 0,
 110,
 0,
 33,
 0]

In [9]:
list("안녕하세요 👋 (hello in Korean!".encode("utf-32"))


[255,
 254,
 0,
 0,
 72,
 197,
 0,
 0,
 85,
 177,
 0,
 0,
 88,
 213,
 0,
 0,
 56,
 193,
 0,
 0,
 148,
 198,
 0,
 0,
 32,
 0,
 0,
 0,
 75,
 244,
 1,
 0,
 32,
 0,
 0,
 0,
 40,
 0,
 0,
 0,
 104,
 0,
 0,
 0,
 101,
 0,
 0,
 0,
 108,
 0,
 0,
 0,
 108,
 0,
 0,
 0,
 111,
 0,
 0,
 0,
 32,
 0,
 0,
 0,
 105,
 0,
 0,
 0,
 110,
 0,
 0,
 0,
 32,
 0,
 0,
 0,
 75,
 0,
 0,
 0,
 111,
 0,
 0,
 0,
 114,
 0,
 0,
 0,
 101,
 0,
 0,
 0,
 97,
 0,
 0,
 0,
 110,
 0,
 0,
 0,
 33,
 0,
 0,
 0]

In [10]:
# text from https://www.reedbeta.com/blog/programmers-intro-to-unicode/
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception."
tokens = text.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(text)
print("length:", len(text))
print('---')
print(tokens)
print("length:", len(tokens))

---
Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
length: 533
---
[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140

**why raw byte don't feed into language models?**

paper :  https://arxiv.org/pdf/2305.07185

---

# Byte-pair encoding algorithm

wiki : https://en.wikipedia.org/wiki/Byte_pair_encodinghttps://en.wikipedia.org/wiki/Byte_pair_encoding



In [11]:
def get_stats(ids):
    counts = {}

    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts


stats = get_stats(tokens)
#print(stats)
print(sorted(((v,k) for k,v in stats.items()), reverse = True))

[(20, (101, 32)), (15, (240, 159)), (12, (226, 128)), (12, (105, 110)), (10, (115, 32)), (10, (97, 110)), (10, (32, 97)), (9, (32, 116)), (8, (116, 104)), (7, (159, 135)), (7, (159, 133)), (7, (97, 114)), (6, (239, 189)), (6, (140, 240)), (6, (128, 140)), (6, (116, 32)), (6, (114, 32)), (6, (111, 114)), (6, (110, 103)), (6, (110, 100)), (6, (109, 101)), (6, (104, 101)), (6, (101, 114)), (6, (32, 105)), (5, (117, 115)), (5, (115, 116)), (5, (110, 32)), (5, (100, 101)), (5, (44, 32)), (5, (32, 115)), (4, (116, 105)), (4, (116, 101)), (4, (115, 44)), (4, (114, 105)), (4, (111, 117)), (4, (111, 100)), (4, (110, 116)), (4, (110, 105)), (4, (105, 99)), (4, (104, 97)), (4, (103, 32)), (4, (101, 97)), (4, (100, 32)), (4, (99, 111)), (4, (97, 109)), (4, (85, 110)), (4, (32, 119)), (4, (32, 111)), (4, (32, 102)), (4, (32, 85)), (3, (118, 101)), (3, (116, 115)), (3, (116, 114)), (3, (116, 111)), (3, (114, 116)), (3, (114, 115)), (3, (114, 101)), (3, (111, 102)), (3, (111, 32)), (3, (108, 108)), (

In [12]:
top_pair = max(stats, key = stats.get)
top_pair

(101, 32)