# Tokenizer

Ok. We basically have 3 types of level for tokenizer <br/>
- Word-level
- Character-level
- Byte-level

## Word-level
- What it does?
    - Take each whole word as a token; unseen words becomes <|UNKNOWN|> (or something else if you were to customize it)
- Pros
    - Short sequences on clean text since it takes whole word as a token and we grab them
- Cons
    - Huge vocab 
        - english has over one million words, meaning we'll have to tokenize one million words... (that's huge)
        
    - Weak for multilingual text 
        - say for example, we need to tokenize Korean. "안녕하세요" then we need to tokenize every Korean words as well
        - and that's gonna be one million (from english) + 100,000 (from korean)

## Character-level
- What it does?
    - Parse each character to Unicode character; then, tokenize it
- Pros
    - Vocab size is tinier than Word-level
- Cons
    - VERY long sequence
        - Think of sentence "Hello, World! I got a call from mom..."
        - Then, we tokenize every letter (we're not considering repeats like "ll" from "Hello" and "call")
        - And that's gonna be pretty lot of work
        - Which is gonna result slow training speed

# Byte-level
- What it does?
    - Iterate through the text, find repeats, tokenize them into byte (0-255)
- Pros
    - Tiny fixed vocab size covers every word.
- Cons
    - Long sequence; needs bigger models to match subword efficiency

# For this project, we'll be using Byte-level with subword family

## Idea
Think of months from January to December.
- january = j + an + uary
- february = febr + uary
- march = m + a + rch
- april = a + pril
- may = m + a + y
- june = j + une
- july = j + uly
- august = a + ugust
- september = sept + em + ber
- october = oct + o + ber
- november = nov + em + ber
- december = dec + em + ber.

then sequence: <br/>
J, nov, dec, febr, m, sept, oct, em, uly, a, une, an, ber, y, uary, rch, pril, ugust<br/>
can cover all of those

Then we're gonna assign unique id for each of those, save it into .json
Then we're gonna save what to merge in .txt

Below is a small example:

In [19]:
from collections import Counter, defaultdict

# Example small corpus
corpus = [
    "january",
    "february",
    "march",
    "april",
    "may",
    "june",
    "july",
    "august",
    "september",
    "october",
    "november",
    "december",
]

# Build vocabulary: word → frequency of symbol sequence
# We append <|EOW|> to mark end-of-word
def build_vocab(corpus):
    vocab = Counter()
    for word in corpus:
        symbols = list(word) + ['<|EOW|>']
        vocab[tuple(symbols)] += 1
    return vocab

vocab = build_vocab(corpus)


In [20]:
def get_pair_stats(vocab):
    """Count all adjacent symbol-pair frequencies."""
    pairs = Counter()
    for symbols, freq in vocab.items():
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i+1])
            pairs[pair] += freq
    return pairs

stats = get_pair_stats(vocab)


In [21]:
def merge_pair(pair, vocab):
    """
    Merge all occurrences of `pair` in the vocab into a single symbol.
    E.g. ('j','a') → 'j a' becomes 'ja'
    """
    merged_vocab = {}
    bigram = pair
    replacement = ''.join(pair)
    for symbols, freq in vocab.items():
        new_symbols = []
        i = 0
        while i < len(symbols):
            # if the pair matches at this position, merge it
            if i < len(symbols)-1 and (symbols[i], symbols[i+1]) == bigram:
                new_symbols.append(replacement)
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1
        merged_vocab[tuple(new_symbols)] = freq
    return merged_vocab

# merge the top pair
best_pair = stats.most_common(1)[0][0]
vocab = merge_pair(best_pair, vocab)


In [22]:
def learn_bpe(vocab, num_merges):
    merges = []
    for _ in range(num_merges):
        stats = get_pair_stats(vocab)
        if not stats:
            break
        best_pair, _ = stats.most_common(1)[0]
        merges.append(best_pair)
        vocab = merge_pair(best_pair, vocab)
    return merges

# Learn, say, 10 merges on our tiny corpus:
bpe_merges = learn_bpe(vocab, num_merges=10)
print("Learned merges:", bpe_merges)


Learned merges: [('b', 'e'), ('be', 'r'), ('ber', '<|EOW|>'), ('a', 'r'), ('e', 'm'), ('em', 'ber<|EOW|>'), ('u', 'ar'), ('uar', 'y<|EOW|>'), ('j', 'u'), ('j', 'a')]


In [None]:
def apply_bpe(token, merges):
    # start from character sequence + <|EOW|>
    symbols = list(token) + ['<|EOW|>']
    # repeatedly apply merges in order
    for pair in merges:
        i = 0
        new_symbols = []
        replacement = ''.join(pair)
        while i < len(symbols):
            if i < len(symbols)-1 and (symbols[i], symbols[i+1]) == pair:
                new_symbols.append(replacement)
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1
        symbols = new_symbols
    return symbols

# Example:
print(apply_bpe("january", bpe_merges))


['ja', 'n', 'uar', 'y', '<|EOW|>']


In [26]:
if __name__ == "__main__":
    corpus = [
    "january",
    "february",
    "march",
    "april",
    "may",
    "june",
    "july",
    "august",
    "september",
    "october",
    "november",
    "december",
    ]
    vocab = build_vocab(corpus)
    merges = learn_bpe(vocab, num_merges=10)
    
    for word in corpus[8:]: # print last 4 for example
        tokens = apply_bpe(word, merges)
        print(word, "→", tokens)


september → ['s', 'e', 'p', 't', 'ember<|EOW|>']
october → ['o', 'c', 't', 'o', 'ber<|EOW|>']
november → ['n', 'o', 'v', 'ember<|EOW|>']
december → ['d', 'e', 'c', 'ember<|EOW|>']


# Using ByteLevelBPETokenizer from tokenizers library

To make our life easier, we'll use ByteLevelBPETokenizer from tokenizers library. <br/>
The only difference is: <br/>
- Units are bytes, not Unicode characters
- Instead of <|EOW|>, it's using Ġ
- It doesn't normalize text; leading space is consistently captured
- It uses Rust (fast)

First, we need books for text. <br/>
I'm gonna use books in Gutenburg.org (since they're licence free) <br/>
For the real one, i fed 75,000 books. But that's like 35 GB. <br/>
So, i'm gonna be using 100 books instead. <br/>
(You can download the full version in here: https://www.kaggle.com/datasets/lokeshparab/gutenberg-books-and-metadata-2025?resource=download&select=books)


## **If you're testing full, turn this to true**

In [1]:
isfull=True

First, we're gona merge all books into text file called "all_books.txt"

In [2]:
from tqdm import tqdm
import glob, os
from pathlib import Path

if isfull:
    out_dir = Path("../materials") # use this for full dataset
else:
    out_dir = Path("../materials_small")

out_dir.mkdir(parents=True, exist_ok=True)

if isfull:
    files = [f for f in glob.glob("../books/*") if os.path.isfile(f)] # use this for full dataset
else: 
    files = [f for f in glob.glob("../books_small/*") if os.path.isfile(f)]

with open(out_dir.__str__()+"/all_books.txt", "w", encoding="utf-8", errors="ignore") as out:
    for fn in tqdm(files, desc="Concatenating books"):
        with open(fn, "r", encoding="utf-8", errors="ignore") as f:
            out.write(f.read())
            out.write("\n")

# You'll get all_books.txt

Concatenating books: 100%|██████████| 72082/72082 [16:10<00:00, 74.29it/s] 


The real one took 17m 10s

Then use ByteLevelBPETokenizer from tokenizers to generate bpe_model-vocab.json and bpe_model-merges.txt

In [3]:
from tokenizers import ByteLevelBPETokenizer
from tqdm import tqdm
from pathlib import Path

print("Counting total lines in the file for tqdm progress bar")
if isfull:
    file_path = "../materials/all_books.txt" # use this for full dataset
else:
    file_path = "../materials_small/all_books.txt"

with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
    print("Counting lines in the file...")
    total_lines = sum(1 for _ in f)

def line_iterator(path):
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in tqdm(f, total=total_lines, desc="Feeding lines"):
            yield line

print("start training BPE tokenizer")
tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(
    line_iterator(file_path),
    vocab_size=60000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["<|PAD|>", "<|UNKNOWN|>", "<|START|>", "<|END|>", "<|SYSTEM|>", "<|USER|>", "<|ASSISTANT|>", "<|EOT|>","<|INFOSTART|>","<|INFOEND|>"]
)
print("BPE tokenizer training complete")

if isfull:
    out_dir = Path("../bpe") # use this for full dataset
else:
    out_dir = Path("../bpe_small")

out_dir.mkdir(parents=True, exist_ok=True)

if isfull:
    tokenizer.save_model("../bpe", "bpe_model")
else:
    tokenizer.save_model("../bpe_small", "bpe_model") 
print("BPE model saved to bpe(_small) directory")


Counting total lines in the file for tqdm progress bar
Counting lines in the file...
start training BPE tokenizer


Feeding lines: 100%|██████████| 583802534/583802534 [26:59<00:00, 360589.68it/s]


BPE tokenizer training complete
BPE model saved to bpe(_small) directory


Ok. small one took around 7m 30s <br/>
6 minutes for feeding phase <br/>
rest for merging phase

The real one took: <br/>
<br/>
RTX3080: <br/>
total: 43m 30.6s<br/>
feeding phase: 28m 3s<br/>
merging phase: around 15m<br/>
<br/>
RTX3060: <br/>
total: around 2 hours <br/>
feeding phase: 1h 27m 15s <br/>
merging phase: around 33m <br/>

starting next one, we're gonna be using the one I've done with RTX3080 <br/>
(saving time. ⭐ Life Good ⭐)

## DONE! If you see bpe folder, and you see vocab.json and materials(_small)/all_books.txt, you're good to go