# 24 - Byte Pair Encoding (BPE) Tokenizer

Byte Pair Encoding (BPE) is a subword tokenization algorithm used in many LLMs (e.g., GPT, RoBERTa). It allows models to handle rare words and open vocabularies by splitting words into common subword units.

In this notebook, you'll scaffold the steps to build a BPE tokenizer from scratch, as used in modern LLMs.

## 📚 What is BPE Tokenization?

BPE iteratively merges the most frequent pairs of symbols in a corpus, building a vocabulary of subword units.

**LLM/Transformer Context:**
- BPE enables LLMs to efficiently represent rare and unknown words, improving generalization and reducing vocabulary size.

### Task:
- Scaffold code to initialize a vocabulary of characters from a text corpus.
- Add comments explaining each step.

In [None]:
# TODO: Initialize vocabulary of characters from a text corpus
pass

## 🔁 BPE Merge Operations

At each step, BPE finds the most frequent pair of symbols and merges them into a new symbol.

**LLM/Transformer Context:**
- This process builds a vocabulary of subword units that balances frequency and flexibility.

### Task:
- Scaffold a function to count symbol pairs in the corpus.
- Scaffold a function to merge the most frequent pair.
- Add docstrings explaining their roles.

In [None]:
def get_stats(corpus):
    """
    Count frequency of symbol pairs in the corpus.
    Args:
        corpus (list of list): Corpus as list of tokenized words (list of symbols)
    Returns:
        dict: Mapping from symbol pair to frequency
    """
    # TODO: Count symbol pairs
    pass

In [None]:
def merge_pair(pair, corpus):
    """
    Merge the most frequent pair in the corpus.
    Args:
        pair (tuple): Symbol pair to merge
        corpus (list of list): Corpus as list of tokenized words
    Returns:
        list of list: Updated corpus with merged pair
    """
    # TODO: Merge the given pair in the corpus
    pass

## 🧮 Building the BPE Vocabulary

Repeat the merge operation until the vocabulary reaches the desired size.

**LLM/Transformer Context:**
- The final BPE vocabulary is used to tokenize text for LLM training and inference.

### Task:
- Scaffold a function to build the BPE vocabulary by iteratively merging pairs.
- Add a docstring explaining the process.

In [None]:
def build_bpe_vocab(corpus, vocab_size):
    """
    Build a BPE vocabulary by iteratively merging symbol pairs.
    Args:
        corpus (list of list): Corpus as list of tokenized words
        vocab_size (int): Desired vocabulary size
    Returns:
        set: Final BPE vocabulary
    """
    # TODO: Build BPE vocabulary
    pass

## 🔗 Tokenizing Text with BPE

Once the BPE vocabulary is built, tokenize new text by applying the learned merges.

**LLM/Transformer Context:**
- BPE tokenization is used in LLMs for both training and inference.

### Task:
- Scaffold a function to tokenize text using the BPE vocabulary.
- Add a docstring explaining its use.

In [None]:
def bpe_tokenize(word, bpe_vocab):
    """
    Tokenize a word using the BPE vocabulary.
    Args:
        word (str): Input word
        bpe_vocab (set): Set of BPE tokens
    Returns:
        list: List of BPE tokens
    """
    # TODO: Tokenize word using BPE merges
    pass

## 🧠 Final Summary: BPE Tokenization in LLMs

- BPE tokenization enables LLMs to efficiently handle rare words and open vocabularies.
- Understanding BPE is key to building and interpreting modern LLMs.

In the next notebook, you'll explore sampling and decoding strategies for language model inference!