<h1>Tokenization in Modern Language Models</h1>

<h1>Overview</h1>

This post is designed to introduce two prominent tokenization methods, Byte Pair Encoding (BPE) and SentencePiece, which are commonly employed in State-of-the-Art (SOTA) Large Language Models such as the GPT series, ALBERT, and T5.

Initially, I will provide an overview of traditional tokenization techniques, outlining their limitations, and explaining how the more recent methods of BPE and SentencePiece overcome these challenges. Next, I will delve into the underlying concepts behind BPE and SentencePiece, giving you a detailed understanding of how they work. Finally, I'll offer a simple example to demonstrate how these methods can be implemented in Python. To be clear, we won't be coding them from scratch but rather utilizing existing libraries to take advantage of these powerful tokenization techniques.

For those interested in exploring further, the SentencePiece paper can be found <a href='https://arxiv.org/abs/1808.06226'>here</a>.

<h3>Tokenization, a General Sense</h3>


Tokenization is the process of breaking text down into a sequence of tokens. For example, the sentence "The sky is blue" is broken down into ['The', 'sky', 'is', 'blue']. A tokenizer is the tool that implements this tokenization process. It is important to note that the correct term is 'token' rather than 'word,' since tokens may not always correspond to clean English (or any other language) words, such as 'sky' or 'blue'. Instead, you might encounter tokens like '##ve' or 'lo', which may seem meaningless but are quite useful for language models and other NLP applications.

The tokenization process often overlaps with preprocessing tasks like converting all tokens to lowercase, lemmatization or stemming, and removing stop words, among others. As a result of tokenization, one can create a vocabulary, which is the set of all possible tokens in a text corpus. In addition, we will have a tool—the tokenizer—that can break any given text into a sequence of tokens from the vocabulary.

One immediate question that may arise is how to handle a sentence that includes tokens not found in the vocabulary. How should we tokenize it? A common heuristic solution is to replace unseen or unknown tokens with a pre-defined symbol like <UNKNOWN>. However, this approach may ignore some tokens that carry significant information, potentially impacting the performance of NLP models.

More recently proposed tokenization methods aim to address this limitation by handling out-of-vocabulary words in a way that does not require the removal or replacement of any unknown tokens.

<h3>Byte Pair Encoding</h3>

<h3>SentencePiece Tokenization</h3>
Regularization During Training: Subword regularization introduces randomness in the tokenization process during training. Unlike conventional static tokenization where a text is always tokenized the same way, subword regularization randomly chooses from multiple valid subword segmentations for the same sentence.

In [8]:
import torch
from torchtext.datasets import PennTreebank
import sentencepiece as spm

train_data, valid_data, test_data = PennTreebank()

with open('data/PennTreeBank.txt', 'w') as f:
    for dataset in [train_data, valid_data, test_data]:
        for sentence in dataset:
            f.write(' '.join(sentence) + '\n')
spm.SentencePieceTrainer.Train('--input=data/PennTreeBank.txt --model_prefix=models/SentencePiecePennTree --vocab_size=67')
