## Summarizing different techniques in Tokenization

Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. In this notebbok, I am considering  the three main types of tokenizers used in Transformers: **Byte-Pair Encoding (BPE)**, **WordPiece**, and **SentencePiece**, Below example show how a common set of sentences are tokenized in each method and visualized examples of which tokenizer type is used by which model.

**Byte Pair Encoding (BPE)**: BPE is a subword tokenization method that iteratively merges the most frequent pairs of characters or subwords in the training corpus to form new tokens. This continues until a desired vocabulary size is reached.
Its commonly used in models like GPT-2, GPT-3, and RoBERTa.

**WordPiece**: WordPiece tokenization, originally developed for BERT, works similarly to BPE but uses a different algorithm to decide which subword pairs to merge, optimizing for likelihood instead of frequency.Its commonly used in models like BERT, ALBERT, DistilBERT, and XLNet.


**SentencePiece**: SentencePiece tokenization is an unsupervised text tokenizer and detokenizer mainly used for neural network-based text generation systems. It treats the input text as a sequence of Unicode characters, making it language-agnostic.Its commonly used in models like T5, MarianMT, and Mistral-7B.

Reference:
https://huggingface.co/docs/transformers/main/en/tokenizer_summary

In [39]:
pip install transformers sentencepiece bertviz matplotlib




In [40]:
from transformers import GPT2Tokenizer, BertTokenizer, T5Tokenizer
import torch
import matplotlib.pyplot as plt
from bertviz import head_view, model_view

# Loading pre-trained BPE tokenizer
bpe_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
bpe_tokenizer.add_special_tokens({'pad_token': '<PAD>'})

# BPE Tokenization and Padding Function
def bpe_tokenize_and_pad(sentences, vocab, n):
    tokenized_sentences = []
    for sentence in sentences:
        tokens = bpe_tokenizer.encode(sentence, add_special_tokens=True, max_length=n, truncation=True)
        if len(tokens) < n:
            tokens += [bpe_tokenizer.pad_token_id] * (n - len(tokens))
        tokenized_sentences.append(tokens)
    return tokenized_sentences

def bpe_decode_tokens(token_ids):
    tokens = bpe_tokenizer.convert_ids_to_tokens(token_ids)
    return tokens

# Loading pre-trained WordPiece tokenizer
wordpiece_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# WordPiece Tokenization and Padding Function
def wordpiece_tokenize_and_pad(sentences, vocab, n):
    tokenized_sentences = []
    for sentence in sentences:
        tokens = wordpiece_tokenizer.encode(sentence, add_special_tokens=True, max_length=n, truncation=True)
        if len(tokens) < n:
            tokens += [wordpiece_tokenizer.pad_token_id] * (n - len(tokens))
        tokenized_sentences.append(tokens)
    return tokenized_sentences

def wordpiece_decode_tokens(token_ids):
    tokens = wordpiece_tokenizer.convert_ids_to_tokens(token_ids)
    return tokens

# Loading pre-trained SentencePiece tokenizer
sentencepiece_tokenizer = T5Tokenizer.from_pretrained('t5-base')

# SentencePiece Tokenization and Padding Function
def sentencepiece_tokenize_and_pad(sentences, vocab, n):
    tokenized_sentences = []
    for sentence in sentences:
        tokens = sentencepiece_tokenizer.encode(sentence, add_special_tokens=True, max_length=n, truncation=True)
        if len(tokens) < n:
            tokens += [sentencepiece_tokenizer.pad_token_id] * (n - len(tokens))
        tokenized_sentences.append(tokens)
    return tokenized_sentences

def sentencepiece_decode_tokens(token_ids):
    tokens = sentencepiece_tokenizer.convert_ids_to_tokens(token_ids)
    return tokens

# Example sentences
if __name__ == "__main__":
    sentences = [
        "As the aircraft becomes lighter, it flies higher in air of lower density to maintain the same airspeed.",
        " When the engine heats up, it operates more efficiently, consuming less fuel to maintain speed.",
        "As the sun sets, the temperature drops, causing the lake to cool down and lose its warm surface layer.",
        "When the ice melts, it becomes water, occupying more volume in its liquid state.",
        "As the river flows downstream, it slows down in wider sections to maintain a constant volume of water."
    ]
    n = 20

    # BPE Tokenization and Decoding
    print("BPE Tokenization and Decoding")
    bpe_vocab = bpe_tokenizer.get_vocab()
    bpe_tokenized_sentences = bpe_tokenize_and_pad(sentences, bpe_vocab, n)
    for ts in bpe_tokenized_sentences:
        print(ts)
        print(bpe_decode_tokens(ts))
        print("\n")
    print("\n")

    # WordPiece Tokenization and Decoding
    print("WordPiece Tokenization and Decoding")
    wordpiece_vocab = wordpiece_tokenizer.get_vocab()
    wordpiece_tokenized_sentences = wordpiece_tokenize_and_pad(sentences, wordpiece_vocab, n)
    for ts in wordpiece_tokenized_sentences:
        print(ts)
        print(wordpiece_decode_tokens(ts))
        print("\n")
    print("\n")

    # SentencePiece Tokenization and Decoding
    print("SentencePiece Tokenization and Decoding")
    sentencepiece_vocab = sentencepiece_tokenizer.get_vocab()
    sentencepiece_tokenized_sentences = sentencepiece_tokenize_and_pad(sentences, sentencepiece_vocab, n)
    for ts in sentencepiece_tokenized_sentences:
        print(ts)
        print(sentencepiece_decode_tokens(ts))
        print("\n")


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


BPE Tokenization and Decoding
[1722, 262, 6215, 4329, 14871, 11, 340, 17607, 2440, 287, 1633, 286, 2793, 12109, 284, 5529, 262, 976, 1633, 12287]
['As', 'Ġthe', 'Ġaircraft', 'Ġbecomes', 'Ġlighter', ',', 'Ġit', 'Ġflies', 'Ġhigher', 'Ġin', 'Ġair', 'Ġof', 'Ġlower', 'Ġdensity', 'Ġto', 'Ġmaintain', 'Ġthe', 'Ġsame', 'Ġair', 'speed']


[1649, 262, 3113, 37876, 510, 11, 340, 14051, 517, 18306, 11, 18587, 1342, 5252, 284, 5529, 2866, 13, 50257, 50257]
['ĠWhen', 'Ġthe', 'Ġengine', 'Ġheats', 'Ġup', ',', 'Ġit', 'Ġoperates', 'Ġmore', 'Ġefficiently', ',', 'Ġconsuming', 'Ġless', 'Ġfuel', 'Ġto', 'Ġmaintain', 'Ġspeed', '.', '<PAD>', '<PAD>']


[1722, 262, 4252, 5621, 11, 262, 5951, 10532, 11, 6666, 262, 13546, 284, 3608, 866, 290, 4425, 663, 5814, 4417]
['As', 'Ġthe', 'Ġsun', 'Ġsets', ',', 'Ġthe', 'Ġtemperature', 'Ġdrops', ',', 'Ġcausing', 'Ġthe', 'Ġlake', 'Ġto', 'Ġcool', 'Ġdown', 'Ġand', 'Ġlose', 'Ġits', 'Ġwarm', 'Ġsurface']


[2215, 262, 4771, 48813, 11, 340, 4329, 1660, 11, 30876, 517, 6115, 287, 66

**model_view** is a function from the bertviz library visualizes the attention weights.

## What the Visualization Shows
**Attention Heads**: Each layer in the BERT model consists of multiple attention heads (typically 12 heads per layer for BERT base).
The visualization shows how each attention head focuses on different tokens when processing a given token by using the attention weights from all layers and heads.

**Layer-wise Attention**: The visualization covers all layers of the BERT model, showing how attention patterns evolve across layers.
Early layers might focus more on local syntactic structures, while later layers might capture more semantic relationships.

**Token Interactions**: The attention weights indicate how much focus one token has on every other token in the sentence.
This helps in understanding relationships like subject-object interactions, modifiers, and contextual dependencies.

## Example Sentence Analysis
Example sentence: "As the aircraft becomes lighter, it flies higher in air of lower density to maintain the same airspeed."

**Tokenization**:

The sentence will be tokenized into subword tokens:

```['[CLS]', 'as', 'the', 'aircraft', 'becomes', 'lighter', ',', 'it', 'flies', 'higher', 'in', 'air', 'of', 'lower', 'density', 'to', 'maintain', 'the', 'same', 'airspeed', '.', '[SEP]']```

**Visualization**:

**Attention Heads**: Each head's visualization shows how it distributes its attention across all tokens.
For instance, the token ```"aircraft"``` might receive attention from tokens like ```"becomes", "lighter", and "flies"```.

**Layer-wise Attention**: Early layers might show attention on adjacent words (e.g., ```"aircraft" -> "the"```), while later layers might capture longer-range dependencies (e.g., ```"flies" -> "airspeed"```).


In [41]:
from transformers import GPT2Tokenizer, BertTokenizer, T5Tokenizer, BertModel, BertConfig
# Loading pre-trained BERT model with output_attentions set to True
config = BertConfig.from_pretrained('bert-base-uncased', output_attentions=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)

# Visualization function
def visualize_embeddings(tokenizer, sentence):
    inputs = tokenizer.encode_plus(sentence, return_tensors='pt')

    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    outputs = model(input_ids, attention_mask=attention_mask)
    attentions = outputs.attentions

    # Display attention using bertviz
    model_view(attentions, tokenizer.convert_ids_to_tokens(input_ids[0]), display_mode='light')



In [42]:
 # Visualization
sentence_to_visualize = sentences[0]  # Considering first sentence as example. Change index to visualize different sentences
print("Visualizing embeddings and attention for the sentence:")
print(sentence_to_visualize)
visualize_embeddings(wordpiece_tokenizer, sentence_to_visualize)

Visualizing embeddings and attention for the sentence:
As the aircraft becomes lighter, it flies higher in air of lower density to maintain the same airspeed.


<IPython.core.display.Javascript object>