Tokenization and detokenization are important parts of an LLM pipeline. This notebook was to explore those steps.

For context, an LLM pipeline takes a bunch of text input and predicts the next word. It works roughly like this:
1. **Tokenize**; breaks the text input into tokens. The tokenization algorithm is independant of how LLMs are trained. Tokens are often sub-words that have meaning. For example, "dogs" might become "dog" and "s".
2. **Embed**; each token is converted into a high-dimensional vector called an embedding. Embeddings are generated by a layer in the LLM called the embedding layer that is trained in parallel with the rest of the LLM, taking advantage of the transformer-based architecture down-stream that provides context awareness for each token so that a given token (like "wind") can have multiple embeddings depending on it's context (e.g., if it is used as a verb or noun).
3. **Predict embedding**; based on preceding tokens' embeddings, the transformer-based neural network predicts the embedding for the next token.
4. **Predict token**; a token is selected based on the predicted embedding, typically sampled from a nearest-neighbor probability distribution in order to create diversity in the generated outputs.
5. **Predict word**; the next word is selected based on the predicted token(s). This is not 1-1; as the tokens for "dog" and "s" may be output as a single word "dogs".

This is repeated, with the predicted word now being appended to the input, until an "end" token is output.

For more information about the architecture and training of transformers (steps 2-4), see the <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a> paper, or consider taking the <a href="https://www.coursera.org/learn/generative-ai-with-llms/home/welcome">Generative AI with Large Language Models</a> Coursera class.

This notebook is for exploring the tokenization/detokenization steps (1 and 5).

In [None]:
# Tokenization/detokenization using various tokenizers from huggingface/transformers
#!pip install transformers

# Import and initialize a few tokenizers
from transformers import (
    BertTokenizer, GPT2Tokenizer, RobertaTokenizer
)
tokenizers = {
    'BERT': BertTokenizer.from_pretrained('bert-base-uncased'),
    'GPT-2': GPT2Tokenizer.from_pretrained('gpt2'),
    'RoBERTa': RobertaTokenizer.from_pretrained('roberta-base'),
}

print ("Tokenizers from hugging face")

# Input text
text = "Huh, the cat's toy is"
print ("Input Text:", text)

# Tokenization
for tokenizer_name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(text)
    print (tokenizer_name)
    print("\tTokens:", tokens)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    print("\tToken IDs:", token_ids)
    decoded_text = tokenizer.decode(token_ids)
    print("\tDecoded Text:", decoded_text)
    print()

In [None]:
# Simple hand-crafted tokenization

print ("Simple hand-crafted tokenizers")

# Define a simple tokenizer
def simple_tokenizer(text):
    return text.split()

# Define a more sophisticated tokenizer
import re
def more_complex_tokenizer(text):
    return re.findall(r'\b\w+\b', text)


tokenizers = {
    'Words': simple_tokenizer,
    'Split off possessives': more_complex_tokenizer,
}

# Input text
text = "Huh, the cat's toy is"
print ("Input Text:", text)

for tokenizer_name, tokenizer in tokenizers.items():
    tokens = tokenizer(text)
    print (tokenizer_name)
    print("\tTokens:", tokens)
    print()