# Tokenization: The Power of Byte Pair Encoding (BPE)

Tokenization, a crucial step in language processing, involves breaking text into smaller units. Among various techniques, Byte Pair Encoding (BPE) stands out as a powerful method.

## Why Use Tokenizers?

Text is complex, but tokenizers simplify it by splitting it into smaller parts. They're crucial because they:

1. **Prepare Text:** Take text like "The cat jumps" and turn it into tokens: ["The", "cat", "jumps"]. 
2. **Manage Words:** Handle words effectively, like breaking down "unpredictability" into smaller parts for easier understanding: ["un", "p", "red", "ict", "ability"].
3. **Create Features:** Tokens become the features that machines use to understand text, like identifying common phrases or terms.


## (Optional) Mount folder in Colab

Uncomment the following cell to mount your gdrive if you are using the notebook in google colab:

In [None]:
# Use the following lines if you want to use Google Colab
# We presume you created a folder "i2dl" within your main drive folder, and put the exercise there.
# NOTE: terminate all other colab sessions that use GPU!
# NOTE 2: Make sure the correct exercise folder (e.g exercise_11) is given.

"""
from google.colab import drive
import os

gdrive_path='/content/gdrive/MyDrive/i2dl/exercise_11'

# This will mount your google drive under 'MyDrive'
drive.mount('/content/gdrive', force_remount=True)
# In order to access the files in this notebook we have to navigate to the correct folder
os.chdir(gdrive_path)
# Check manually if all files are present
print(sorted(os.listdir()))
"""

### Install new packages

In [None]:
!python -m pip install transformers
!python -m pip install tokenizer

# 0. Setup

First, let's download the required datasets as well as the pretrained models!

In [None]:
from exercise_code.util.download_util import download_pretrainedModels, download_datasets

download_datasets()
download_pretrainedModels()

And now, we can import all of the required packages and get started on this notebook.

In [None]:
from exercise_code.data.BytePairTokenizer import *
from tokenizers import Tokenizer
import os

%load_ext autoreload
%autoreload 2

In [None]:
root_path = os.path.dirname(os.path.abspath(os.getcwd()))
model_path = os.path.join(os.getcwd(), 'models')
pretrained_model_path = os.path.join(model_path, 'pretrainedModels')
dataset_path = os.path.join(root_path, 'datasets', 'transformerDatasets')

## 1. Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) initially served as a text compression algorithm and later found application in OpenAI's GPT model for tokenization. It remains a foundational technique employed across numerous Transformer models such as GPT, GPT-2, RoBERTa, BART, and DeBERTa. BPE intelligently breaks text into tokens by merging pairs of characters. For example:

In [None]:
file_path = os.path.join(pretrained_model_path, 'pretrained_tokenizer')
tokenizer = Tokenizer.from_file(file_path)

sentence = "Hi, Introduction to Deep Learning is class IN2346!"
encodings = tokenizer.encode(sentence)
tokens = encodings.tokens
print(tokens)

Note: The character Ġ is used to mark the location of whitespaces.

From here we can convert the individual tokens into a list of IDs, that we can feed into a model:

In [None]:
token_ids = encodings.ids
print(token_ids)

And we can also go back to the original sentence:

In [None]:
tokenizer.decode(token_ids)

## 1.1 Training Algorithm
Let's create our own BPE Tokenizer from scratch! You can see the entire implementation in <code>exercise_code/data/BytePairTokenizer.py</code>! Note: While these algorithms are often called training algorithms, they usually do not perform training as we've seen it so far using some kind of loss function! It's really more of a algorithm that step by step creates the individual tokens! With that said, let's dive in!

BPE training starts by computing the unique set of words used in the corpus, then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:

In [None]:
words = ["hug", "pug", "pun", "bun", "hugs"]

base_vocabulary = create_alphabet_from_list(words)
print(base_vocabulary)

For real-world cases, that base vocabulary will contain all the ASCII characters, at the very least, and probably some Unicode characters as well. If a character that was not in the training corpus is passed on to the tokenizer, that character will be converted to the unknown token. That’s one reason why lots of NLP models are very bad at analyzing content with emojis, for instance.

The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.

After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.

At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.

Going back to our previous example, let’s assume the words in our corpus had the following frequencies:

In [None]:
word_freq = {"hug": 10, "pug": 5, "pun": 12, "bun": 4, "hugs": 5}

Meaning "hug" was present 10 times in the corpus, "pug" 5 times, "pun" 12 times, "bun" 4 times, and "hugs" 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:

In [None]:
splits = create_splits(word_freq.keys())
print('Words split into characters: {}'.format(splits))

Then we look at pairs. The pair ("h", "u") is present in the words "hug" and "hugs", so 15 times total in the corpus:

In [None]:
pair_freq = compute_pair_freq(word_freq, splits)

print('Pair frequencies in corpus: {}'.format(dict(pair_freq)))

It’s not the most frequent pair, though: that honor belongs to ("u", "g"), which is present in "hug", "pug", and "hugs", for a grand total of 20 times in the vocabulary:

In [None]:
best_pair = compute_best_pair(pair_freq)

print("The best pair is: {}".format(best_pair))

Thus, the first merge rule learned by the tokenizer is ("u", "g") -> "ug", which means that "ug" will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:

In [None]:
merges = {}
splits = merge_pair(word_freq, *best_pair, splits)
merges[best_pair] = best_pair[0] + best_pair[1]
base_vocabulary.append(best_pair[0] + best_pair[1])

print('The new splits are: {}'.format(splits))
print('New vocabulary: {}'.format(base_vocabulary))
print('Dictionary with all merges: {}'.format(merges))

Now we have some pairs that result in a token longer than two characters: the pair ("h", "ug"), for instance (present 15 times in the corpus). However, the most frequent pair at this stage is ("u", "n"), present 16 times in the corpus, so the second merge rule learned is ("u", "n") -> "un".

In [None]:
pair_freq = compute_pair_freq(word_freq, splits)
best_pair = compute_best_pair(pair_freq)

print('Pair frequencies in corpus: {}'.format(dict(pair_freq)))
print("The best pair is: {}".format(best_pair))

 Adding that to the vocabulary and merging all existing occurrences leads us to:

In [None]:
splits = merge_pair(word_freq, *best_pair, splits)
merges[best_pair] = best_pair[0] + best_pair[1]
base_vocabulary.append(best_pair[0] + best_pair[1])

print('The new splits are: {}'.format(splits))
print('New vocabulary: {}'.format(base_vocabulary))
print('Dictionary with all merges: {}'.format(merges))

Again, let's compute the most frequent pair:

In [None]:
pair_freq = compute_pair_freq(word_freq, splits)
best_pair = compute_best_pair(pair_freq)

print('Pair frequencies in corpus: {}'.format(dict(pair_freq)))
print("The best pair is: {}".format(best_pair))

Now the most frequent pair is ("h", "ug"), so we learn the merge rule ("h", "ug") -> "hug", which gives us our first three-letter token. After the merge, the corpus looks like this:

In [None]:
splits = merge_pair(word_freq, *best_pair, splits)
merges[best_pair] = best_pair[0] + best_pair[1]
base_vocabulary.append(best_pair[0] + best_pair[1])

print('The new splits are: {}'.format(splits))
print('New vocabulary: {}'.format(base_vocabulary))
print('Dictionary with all merges: {}'.format(merges))

And we continue like this until we reach the desired vocabulary size.

Feel free to have a look at the Tokenizer Implementation in BytePairTokenizer! Note that we will be using a different implementation from Huggingfacce though for the following notebooks of this exercise. It is implemented in Rust and is a lot faster than this Python code, however, the algorithm remains the same! In fact, let's train one right now!

First we have to initialize our model as a Byte Pair Encoder:

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())

Next we have to initialize our Trainer. We will have one special character - '<[EOS]>' - which will mark the beginning and the end of a sentence and will also be used for padding, more on that later.

In [None]:
from tokenizers import trainers
vocab_size = 300
eos_token = '<[EOS]>'
trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=[eos_token])

Next, we have to define our Pretokenizer, which splits the sentences into individual words. We will in fact be using a sequence of predefined models:

1. **ByteLevel**: Replaces all whitespaces with a special character Ġ and splits the sentences 
2. **Digits**: Splits all sequences of digits into individual digits. That way we don't waste any words on often occurring numbers
3. **Punctuation**: Splits sentences at punctuations 

In [None]:
from tokenizers.pre_tokenizers import ByteLevel, Digits, Sequence, Punctuation
pre_tokenizer = Sequence([ByteLevel(add_prefix_space=False), Digits(individual_digits=True), Punctuation()]) 
output = pre_tokenizer.pre_tokenize_str(sentence)
print(output)

In [None]:
tokenizer.pre_tokenizer = pre_tokenizer

And finally we can declare the list of files to train on and start the actual training process. Depending on the size of vocabulary and your hardware, this might take a couple minutes.

Note: If this doesn't work or takes way too long (>10 min), don't worry about it! Just read through the following cells and try to understand what is happening! We have a pretrained version of this tokenizer that you can use in the following exercises!

In [None]:
files = [os.path.join(dataset_path, 'europarlOpusDatasets', 'corpus_english.txt'),
         os.path.join(dataset_path, 'europarlOpusDatasets', 'corpus_german.txt')]
tokenizer.train(files, trainer)

Alright, training is done and last thing we have to do is define the template of our output. We want each token sequence to start and end with an end of sequence token. We will discuss why later in the actual transformer notebook.

In [None]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single=eos_token + " $0 " + eos_token,
    pair=None,
    special_tokens=[
        (eos_token, tokenizer.token_to_id(eos_token))
    ],
)

Now let's test it!

In [None]:
output = tokenizer.encode(sentence)
print(output.tokens)

In [None]:
print(output.ids)

Let's try to decode it!

In [None]:
tokenizer.decode(output.ids)

Ups, something doesn't look right... That's because we still have to configure the decoder! Otherwise it does't know what to do with the Ġ character!

In [None]:
from tokenizers.decoders import ByteLevel
tokenizer.decoder = ByteLevel()

Let's try that again:

In [None]:
tokenizer.decode(output.ids)

Prefect everything seems to work! Let's save this model and reuse it later in the transformer!

In [None]:
file_path = os.path.join(model_path, "custom_tokenizer")
tokenizer.save(file_path)

So far, the tokenizer is still implemented in python. If we want to use the faster Rust implementation we have to load it as a Fast Tokenizer!

In [None]:
from transformers import PreTrainedTokenizerFast

tokenizer_fast = PreTrainedTokenizerFast(
    tokenizer_file=file_path,
    # tokenizer_object=tokenizer, # This also works!
    eos_token=eos_token,
    pad_token=eos_token
)

In [None]:
output = tokenizer_fast.encode(sentence)
print(output)

In [None]:
tokenizer_fast.decode(output, skip_special_tokens=True)

Perfect, thats all you have to know about tokenizers for now! We will now have a closer look at the dataset and dataloader used in the last exercise! So see you in <code>2_dataset_and_collator.ipynb</code>!