<h1>Tokenization in Modern Language Models</h1>

<h1>Overview</h1>

This post is designed to introduce two prominent tokenization methods, Byte Pair Encoding (BPE) and SentencePiece, which are commonly employed in State-of-the-Art (SOTA) Large Language Models such as the GPT series, ALBERT, and T5.

Initially, I will provide an overview of traditional tokenization techniques, outlining their limitations, and explaining how the more recent methods of BPE and SentencePiece overcome these challenges. Next, I will delve into the underlying concepts behind BPE and SentencePiece, giving you a detailed understanding of how they work. Finally, I'll offer a simple example to demonstrate how these methods can be implemented in Python. To be clear, we won't be coding them from scratch but rather utilizing existing libraries to take advantage of these powerful tokenization techniques.

For those interested in exploring further, the SentencePiece paper can be found <a href='https://arxiv.org/abs/1808.06226'>here</a>.

<h3>Tokenization, a General Sense</h3>


Tokenization is the process of breaking text down into a sequence of tokens. For example, the sentence "The sky is blue" is broken down into ['The', 'sky', 'is', 'blue']. A tokenizer is the tool that implements this tokenization process. Please note that the commonly used term is 'token' rather than 'word,' since tokens may not always correspond to clean English (or any other language) words, such as 'sky' or 'blue'. Instead, you might encounter tokens like '##ve' or 'lo', which may seem meaningless but are quite useful for language models and other NLP applications.

The tokenization process often overlaps with preprocessing tasks like converting all tokens to lowercase, lemmatization or stemming, and removing stop words, among others. As a result of tokenization, one can create a vocabulary, which is the set of all possible tokens in a text corpus. In addition, we will have a tool—the tokenizer—that can break any given text into a sequence of tokens from the vocabulary.



One trivial way of tokenization is simply splitting a text into its constituent words using the spaces between the words:

In [1]:
text = 'The sky is blue'
tokenized_text = text.split(' ')
print(tokenized_text)

['The', 'sky', 'is', 'blue']


Obviously, splitting a text using merely a space is not an optimum way since a text usually includes punctuation, and words might have different forms, etc. For example, given the sentence 'The fact is: the sky has not been blue in the past three days,' should we treat 'is' and 'been' as the same token? Should we convert all letters to lower (upper) case? Should we remove the ':' after the word 'is,' or keep it? These are some natural questions and scenarios that we should answer before tokenizing our texts. Additionally, do not forget that after designing our tokenizer, we should go over all the texts that we have in our corpus and tokenize them to build our vocabulary, which is the set of all tokens in our corpus.

There are a few libraries that can help us perform better tokenization rather than simply splitting the text. In the following, I will give two examples: one using spaCy, and the other using NLTK.

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Hello, world! The sky is blue."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)



['Hello', ',', 'world', '!', 'The', 'sky', 'is', 'blue', '.']


Please ignore the warning. It simply explains that I am using spaCy version 3.6.1, while the model 'en_core_web_sm' was trained using spaCy version 3.0.0.

The model 'en_core_web_sm' is a versatile pre-trained model that can perform various NLP tasks beyond tokenization. Some of these tasks include Part-of-Speech Tagging, Named Entity Recognition, and Dependency Parsing, to name just three."

Here is another example of tokenization using NLTK:

In [3]:
import nltk
from nltk.tokenize import word_tokenize
text = "Hello, world! The sun is shinning."
tokens = word_tokenize(text)
print(tokens)

['Hello', ',', 'world', '!', 'The', 'sun', 'is', 'shinning', '.']


You may need to download the Punkt tokenizer models using the following line of code:
<i><b>nltk.download('punkt')</b></i>

<h3>How about out-of-vocabulary tokens?</h3>

One immediate question that may arise is how to handle a sentence containing tokens not found in the vocabulary. How should we tokenize it? A common heuristic solution is to replace unseen or unknown tokens with a pre-defined symbol like <UNKNOWN>. However, this approach might overlook tokens that carry significant information, potentially affecting the performance of NLP models. Additionally, dealing with punctuation and special characters often relies on heuristic methods and may vary according to the preferences or standards set by individual developers.

More recently proposed tokenization methods aim to address these limitations by handling out-of-vocabulary words and special tokens or characters in a more standardized and consistent manner.

In what follows, I will delve into BPE and SentencePiece, the two most recent tokenization techniques that are widely employed with state-of-the-art language models.   

<h3>Byte Pair Encoding</h3>

<h3>SentencePiece Tokenization</h3>
Regularization During Training: Subword regularization introduces randomness in the tokenization process during training. Unlike conventional static tokenization where a text is always tokenized the same way, subword regularization randomly chooses from multiple valid subword segmentations for the same sentence.

In [None]:
import torch
from torchtext.datasets import PennTreebank
import sentencepiece as spm

train_data, valid_data, test_data = PennTreebank()

with open('data/PennTreeBank.txt', 'w') as f:
    for dataset in [train_data, valid_data, test_data]:
        for sentence in dataset:
            f.write(' '.join(sentence) + '\n')
spm.SentencePieceTrainer.Train('--input=data/PennTreeBank.txt --model_prefix=models/SentencePiecePennTree --vocab_size=67')
