# üî§ Implement Byte Pair Encoding from Scratch

### Problem Statement

Implement **Byte Pair Encoding (BPE)**, the subword tokenization algorithm used in GPT-2, RoBERTa, and many modern NLP models. BPE learns a vocabulary by iteratively merging the most frequent character pairs, creating an efficient subword representation.

By the end of this notebook, you'll understand how to build a tokenizer that handles unknown words gracefully while keeping vocabulary size manageable.

---

### Background: The Tokenization Problem

**Why do we need tokenization?**

Neural networks don't understand text directly - they need numbers. Tokenization converts text into discrete units (tokens) that can be mapped to numbers.

**Three main approaches:**

1. **Character-level tokenization**
   - Split: `"hello"` ‚Üí `['h', 'e', 'l', 'l', 'o']`
   - ‚úÖ Pros: Tiny vocabulary (~100 chars), handles any word
   - ‚ùå Cons: Long sequences, loses semantic info ("un" + "happy" = related to "happy")

2. **Word-level tokenization**
   - Split: `"hello world"` ‚Üí `['hello', 'world']`
   - ‚úÖ Pros: Preserves word meanings
   - ‚ùå Cons: Huge vocabulary (~100K+ words), can't handle unknown words ("asdfghjkl" ‚Üí ???)

3. **Subword tokenization (BPE, WordPiece, SentencePiece)**
   - Split: `"unhappiness"` ‚Üí `['un', 'happiness']` or `['un', 'happi', 'ness']`
   - ‚úÖ Pros: Medium vocabulary (~32K-50K), handles unknown words by breaking into pieces
   - ‚úÖ Example: Unknown word "Transformerization" ‚Üí `['Transform', 'er', 'ization']`

**BPE is the sweet spot!** Used in GPT-2, GPT-3, RoBERTa, BART, and more.

---

### How BPE Works: Step-by-Step Example

Let's walk through BPE on a tiny corpus: `["low", "lower", "newest", "widest"]`

#### Step 0: Initialization
Split each word into characters + end-of-word marker `</w>`:
```
low    ‚Üí ('l', 'o', 'w', '</w>')
lower  ‚Üí ('l', 'o', 'w', 'e', 'r', '</w>')
newest ‚Üí ('n', 'e', 'w', 'e', 's', 't', '</w>')
widest ‚Üí ('w', 'i', 'd', 'e', 's', 't', '</w>')
```

**Why `</w>`?** To distinguish word endings. Otherwise "er" in "lower" vs "er" as a standalone word would be ambiguous.

#### Step 1: Count all adjacent pairs
```
('e', 's'): 2  ‚Üê appears in "newest" and "widest"
('e', 'r'): 1  ‚Üê appears in "lower"
('s', 't'): 2  ‚Üê appears in "newest" and "widest"
...
```

#### Step 2: Merge the most frequent pair
Most frequent: `('e', 's')` with count 2
```
Before: ('n', 'e', 'w', 'e', 's', 't', '</w>')
After:  ('n', 'e', 'w', 'es', 't', '</w>')     ‚Üê 'e' and 's' merged into 'es'
```

#### Step 3: Repeat
Count pairs again, merge most frequent, repeat for N merges.

Each merge creates a new subword token. After 10 merges, you might have tokens like:
```
['l', 'o', 'w', 'e', 'r', 's', 't', 'i', 'd', 'n', 'es', 'est', 'est</w>', ...]
```

---

### Algorithm Pseudocode

```python
def byte_pair_encoding(corpus, num_merges):
    # 1. Initialize: split words into characters
    vocab = {word ‚Üí (char_tuple, frequency)}
    
    for i in range(num_merges):
        # 2. Count all adjacent pairs across all words
        pairs = count_pairs(vocab)
        
        if no pairs:
            break
        
        # 3. Find most frequent pair
        best_pair = max(pairs, key=frequency)
        
        # 4. Merge that pair in all words
        vocab = merge_pair(best_pair, vocab)
        
        # 5. Record this merge operation
        merges.append(best_pair)
    
    return vocab, merges
```

---

### Learning Objectives

By completing this exercise, you will:

1. ‚úÖ Understand why subword tokenization is superior to character or word-level
2. ‚úÖ Implement the core BPE algorithm from scratch
3. ‚úÖ Learn to manage vocabulary as frequency-weighted tuples
4. ‚úÖ Handle the end-of-word marker convention correctly
5. ‚úÖ Understand how merge operations create a learned vocabulary

---

### Requirements

Implement the following 4 functions:

1. **`get_vocab(corpus)`** - Initialize vocabulary from a list of words
   - Input: `["low", "lower"]`
   - Output: `{('l','o','w','</w>'): 1, ('l','o','w','e','r','</w>'): 1}`

2. **`get_stats(vocab)`** - Count frequency of all adjacent character pairs
   - Input: `{('l','o','w','</w>'): 1}`
   - Output: `{('l','o'): 1, ('o','w'): 1, ('w','</w>'): 1}`

3. **`merge_vocab(pair, vocab)`** - Merge a specific pair across all words
   - Input: `pair=('o','w')`, `vocab={('l','o','w','</w>'): 1}`
   - Output: `{('l','ow','</w>'): 1}`

4. **`byte_pair_encoding(corpus, num_merges)`** - Main BPE algorithm
   - Orchestrates the above functions to perform N merge operations

---

### Hints

<details>
  <summary>üí° Hint 1: Vocabulary Structure</summary>
  
  The vocabulary is a dictionary mapping **tuples** of characters to their **frequency**:
  ```python
  vocab = {
      ('l', 'o', 'w', '</w>'): 1,      # "low" appears once
      ('l', 'o', 'w', 'e', 'r', '</w>'): 1  # "lower" appears once
  }
  ```
  
  Use `Counter()` to count word frequencies, then convert each word to a tuple of characters.
</details>

<details>
  <summary>üí° Hint 2: Counting Pairs</summary>
  
  To count adjacent pairs in a word tuple `('l', 'o', 'w', '</w>')`, iterate with a sliding window:
  ```python
  for i in range(len(word) - 1):
      pair = (word[i], word[i+1])  # ('l','o'), ('o','w'), ('w','</w>')
      pairs[pair] += freq  # Weight by word frequency!
  ```
</details>

<details>
  <summary>üí° Hint 3: Merging Pairs</summary>
  
  To merge `('o', 'w')` in `('l', 'o', 'w', '</w>')` ‚Üí `('l', 'ow', '</w>')`:
  
  1. Convert tuple to space-separated string: `"l o w </w>"`
  2. Replace bigram: `"l o w </w>".replace("o w", "ow")` ‚Üí `"l ow </w>"`
  3. Split back to tuple: `('l', 'ow', '</w>')`
  
  This handles all occurrences of the pair in a single operation!
</details>

<details>
  <summary>üí° Hint 4: Main Loop</summary>
  
  The main BPE loop:
  ```python
  for iteration in range(num_merges):
      pairs = get_stats(vocab)
      if not pairs:  # No more pairs to merge
          break
      best = max(pairs, key=pairs.get)  # Most frequent pair
      vocab = merge_vocab(best, vocab)
      merges.append(best)
  ```
</details>

---

### Implementation

In [1]:
from collections import defaultdict, Counter

In [2]:
def get_vocab(corpus):
    """
    Initialize vocabulary from a corpus of words.
    
    Each word is split into characters with an end-of-word marker '</w>'.
    Returns a dictionary mapping character tuples to their frequency.
    
    Args:
        corpus: List of words (strings), e.g., ["low", "lower", "newest"]
    
    Returns:
        vocab: Dict mapping tuples to counts
               e.g., {('l','o','w','</w>'): 1, ('l','o','w','e','r','</w>'): 1}
    
    Example:
        >>> get_vocab(["low", "low"])
        {('l', 'o', 'w', '</w>'): 2}
    """
    vocab = Counter()
    for word in corpus:
        # Convert word to tuple of characters + end marker
        tokens = tuple(list(word) + ['</w>'])
        vocab[tokens] += 1
    return vocab

In [3]:
def get_stats(vocab):
    """
    Count the frequency of all adjacent character pairs in the vocabulary.
    
    Args:
        vocab: Dict mapping character tuples to frequencies
               e.g., {('l','o','w','</w>'): 1}
    
    Returns:
        pairs: Dict mapping character pairs to their total frequency
               e.g., {('l','o'): 1, ('o','w'): 1, ('w','</w>'): 1}
    
    Example:
        >>> get_stats({('t','e','s','t','</w>'): 1})
        {('t','e'): 1, ('e','s'): 1, ('s','t'): 1, ('t','</w>'): 1}
    """
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        # Count all adjacent pairs in this word
        for i in range(len(word) - 1):
            pair = (word[i], word[i + 1])
            pairs[pair] += freq  # Weight by word frequency
    return pairs

In [4]:
def merge_vocab(pair, vocab):
    """
    Merge a specific character pair across all words in the vocabulary.
    
    Replaces all occurrences of the pair with a single merged token.
    
    Args:
        pair: Tuple of two characters to merge, e.g., ('e', 's')
        vocab: Current vocabulary dict
    
    Returns:
        new_vocab: Updated vocabulary with merged pairs
    
    Example:
        >>> merge_vocab(('e','s'), {('t','e','s','t','</w>'): 1})
        {('t', 'es', 't', '</w>'): 1}
    """
    new_vocab = {}
    bigram = ' '.join(pair)  # e.g., "e s"
    replacement = ''.join(pair)  # e.g., "es"
    
    for word, freq in vocab.items():
        # Convert tuple to space-separated string
        word_str = ' '.join(word)
        # Replace bigram with merged symbol
        new_word_str = word_str.replace(bigram, replacement)
        # Convert back to tuple
        new_vocab[tuple(new_word_str.split())] = freq
    
    return new_vocab

In [5]:
def byte_pair_encoding(corpus, num_merges=10):
    """
    Perform Byte Pair Encoding on a corpus.
    
    Iteratively merges the most frequent character pairs to build a vocabulary.
    
    Args:
        corpus: List of words
        num_merges: Number of merge operations to perform (vocab size growth)
    
    Returns:
        vocab: Final vocabulary after all merges
        merges: List of merge operations (pairs) in order performed
    
    Example:
        >>> vocab, merges = byte_pair_encoding(["low", "lower"], num_merges=2)
        >>> len(merges)
        2
    """
    vocab = get_vocab(corpus)
    merges = []
    
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        
        # Find most frequent pair
        best = max(pairs, key=pairs.get)
        vocab = merge_vocab(best, vocab)
        merges.append(best)
        print(f"Merge {i + 1}: {best}")
    
    return vocab, merges

### Example Usage

Test the implementation on a sample corpus:

In [6]:
# Example corpus
corpus = ["low", "lowest", "newer", "wider"]

# Run BPE
final_vocab, merge_operations = byte_pair_encoding(corpus, num_merges=10)

print("\nFinal Vocabulary:")
for word, freq in final_vocab.items():
    print(f"  {' '.join(word)} : {freq}")

Merge 1: ('l', 'o')
Merge 2: ('lo', 'w')
Merge 3: ('e', 'r')
Merge 4: ('er', '</w>')
Merge 5: ('low', '</w>')
Merge 6: ('low', 'e')
Merge 7: ('lowe', 's')
Merge 8: ('lowes', 't')
Merge 9: ('lowest', '</w>')
Merge 10: ('n', 'e')

Final Vocabulary:
  low</w> : 1
  lowest</w> : 1
  ne w er</w> : 1
  w i d er</w> : 1


### Testing Your Implementation

Run these tests to verify correctness:

In [7]:
def test_get_vocab():
    """Test vocabulary initialization."""
    corpus = ["test"]
    vocab = get_vocab(corpus)
    expected = {('t', 'e', 's', 't', '</w>'): 1}
    assert vocab == expected, f"Expected {expected}, got {vocab}"
    print("‚úì test_get_vocab passed")

def test_get_stats():
    """Test pair counting."""
    vocab = {('t', 'e', 's', 't', '</w>'): 1}
    stats = get_stats(vocab)
    expected = {
        ('t', 'e'): 1,
        ('e', 's'): 1,
        ('s', 't'): 1,
        ('t', '</w>'): 1
    }
    assert stats == expected, f"Expected {expected}, got {stats}"
    print("‚úì test_get_stats passed")

def test_merge_vocab():
    """Test pair merging."""
    vocab = {('t', 'e', 's', 't', '</w>'): 1}
    merged = merge_vocab(('e', 's'), vocab)
    expected = {('t', 'es', 't', '</w>'): 1}
    assert merged == expected, f"Expected {expected}, got {merged}"
    print("‚úì test_merge_vocab passed")

def test_bpe_sequence():
    """Test full BPE algorithm."""
    corpus = ["low", "lower", "newest", "widest"]
    final_vocab, merges = byte_pair_encoding(corpus, num_merges=5)
    
    # Check that we got 5 merge operations
    assert len(merges) == 5, f"Expected 5 merges, got {len(merges)}"
    
    # Check that all merges are tuples of length 2
    assert all(isinstance(pair, tuple) and len(pair) == 2 for pair in merges), \
        "All merges should be tuples of length 2"
    
    # Check that vocabulary is a dict
    assert isinstance(final_vocab, dict), "Vocabulary should be a dictionary"
    
    print("‚úì test_bpe_sequence passed")

# Run all tests
test_get_vocab()
test_get_stats()
test_merge_vocab()
test_bpe_sequence()

print("\n" + "="*60)
print("‚úì All tests passed! Your BPE implementation is correct.")
print("="*60)

‚úì test_get_vocab passed
‚úì test_get_stats passed
‚úì test_merge_vocab passed
Merge 1: ('l', 'o')
Merge 2: ('lo', 'w')
Merge 3: ('e', 's')
Merge 4: ('es', 't')
Merge 5: ('est', '</w>')
‚úì test_bpe_sequence passed

‚úì All tests passed! Your BPE implementation is correct.


### Understanding Your Results

After running BPE, you should see merge operations like:

```
Merge 1: ('e', 's')    ‚Üê 'es' appears in "newest" and "widest"
Merge 2: ('es', 't')   ‚Üê 'est' is common
Merge 3: ('est', '</w>') ‚Üê 'est</w>' is a word ending pattern
...
```

Notice how BPE:
- Discovers common subwords automatically ("est", "er")
- Learns word endings ("est</w>", "er</w>")
- Builds from frequent patterns first

This is exactly how GPT-2's tokenizer was trained on web text!

---

## Summary

### Key Concepts

- **Subword Tokenization**: Splits words into meaningful units smaller than words but larger than characters
- **BPE Algorithm**: Iteratively merges most frequent character pairs to build vocabulary
- **End-of-Word Marker**: `</w>` distinguishes word boundaries ("er" in "lower" vs standalone "er")
- **Vocabulary Control**: Number of merges directly controls vocabulary size
- **OOV Handling**: Unknown words can always be broken down into known subwords

### Why BPE Works

1. **Frequency-based**: Common words stay whole, rare words get split
2. **Data-driven**: Learns patterns from your corpus automatically
3. **Compresses well**: Efficient representation of text
4. **Generalizes**: Handles morphology ("unhappiness" ‚Üí "un" + "happiness")

---

## Interview Tips

Be ready to answer:

**Q: Why use BPE instead of word-level tokenization?**
- A: BPE handles out-of-vocabulary (OOV) words by breaking them into known subwords. Word-level fails on rare/new words. BPE also has a much smaller vocabulary (32K vs 100K+ words).

**Q: Why use BPE instead of character-level tokenization?**
- A: Character-level creates very long sequences (inefficient) and loses semantic information. BPE preserves common subword units like "ing", "tion", "un" which carry meaning.

**Q: What's the time complexity of BPE?**
- A: O(N √ó M) where N = number of merges and M = corpus size. Each merge requires scanning the corpus to count pairs and update vocabulary.

**Q: Why do we need the `</w>` end-of-word marker?**
- A: To distinguish word boundaries. Without it, "er" at the end of "lower" would be indistinguishable from "er" as a standalone word. This matters for proper tokenization during inference.

**Q: How does BPE handle a completely unknown word?**
- A: It recursively breaks it down using the learned merge operations (in reverse - apply splits). Worst case, it falls back to individual characters, which are always in the vocabulary.

**Q: What's the difference between BPE and WordPiece?**
- A: BPE merges based on frequency. WordPiece (used in BERT) merges based on likelihood - chooses pairs that maximize probability of the training data. BPE is simpler and works well in practice.

**Q: How do you choose the number of merges?**
- A: It's a hyperparameter balancing vocabulary size vs sequence length. Common choices: 32K (GPT-2), 50K (RoBERTa). More merges = larger vocab but shorter sequences.

---

## References

- [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016)](https://arxiv.org/abs/1508.07909) - Original BPE paper
- [Language Models are Unsupervised Multitask Learners (Radford et al., 2019)](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) - GPT-2 paper using BPE
- [SentencePiece: A simple and language independent approach to subword tokenization](https://github.com/google/sentencepiece) - Modern implementation
- [HuggingFace Tokenizers](https://huggingface.co/docs/tokenizers/) - Fast BPE implementation
- [Practical BPE Tutorial](https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0)