# AIG 230 â€“ BPE Demo Notebook  
## Byte Pair Encoding (BPE) with the State of the Union Corpus

**Purpose of this notebook**  
This notebook demonstrates, end-to-end, how **Byte Pair Encoding (BPE)** works using a *real corpus*: the U.S. State of the Union addresses.

This notebook is intentionally simple and conceptual. It exists to build correct mental models about subword tokenization.


## 1. Setup

We use:
- NLTK to access the State of the Union corpus
- Hugging Face `tokenizers` to train a simple BPE tokenizer


In [2]:
# Install if needed
# !pip install nltk tokenizers

import nltk
from nltk.corpus import state_union

nltk.download("state_union")


[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\W1tcher\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

## 2. Load and Inspect the Corpus

In [3]:
fileids = state_union.fileids()
len(fileids), fileids[:5]


(65,
 ['1945-Truman.txt',
  '1946-Truman.txt',
  '1947-Truman.txt',
  '1948-Truman.txt',
  '1949-Truman.txt'])

In [4]:
texts = [state_union.raw(fid) for fid in fileids[:10]]
corpus_text = "\n".join(texts)

corpus_text[:500]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when worl"

## 3. Why BPE?

BPE learns **subword units** from data instead of relying on predefined words.


## 4. Train a Simple BPE Tokenizer

In [5]:
%pip install tokenizers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(
    vocab_size=200,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

tokenizer.train_from_iterator([corpus_text], trainer=trainer)


## 5. Inspect the Learned Vocabulary

In [7]:
vocab = tokenizer.get_vocab()
list(vocab.items())[:30]


[('I', 37),
 ('un', 136),
 ('ec', 114),
 ('ld', 142),
 ('4', 20),
 ('[MASK]', 4),
 ('"', 6),
 ('uc', 141),
 ('in', 85),
 ('op', 140),
 (')', 11),
 ('(', 10),
 ('qu', 195),
 ('ment', 124),
 ('gr', 187),
 ('k', 66),
 ('pro', 116),
 ('it', 103),
 ('8', 24),
 (']', 55),
 ('ul', 159),
 ('ut', 186),
 ('ag', 138),
 ('d', 59),
 ('ong', 183),
 ('ion', 97),
 ('S', 47),
 ('ol', 135),
 ('W', 51),
 ('ain', 169)]

## 6. BPE Tokenization Example

In [8]:
sentence = "Democracy and democratic institutions must be protected."
tokenizer.encode(sentence).tokens


['D',
 'e',
 'mo',
 'c',
 'r',
 'ac',
 'y',
 'and',
 'de',
 'mo',
 'c',
 'r',
 'at',
 'ic',
 'in',
 'st',
 'it',
 'u',
 'tion',
 's',
 'm',
 'ust',
 'be',
 'pro',
 't',
 'ec',
 't',
 'ed',
 '.']

## 7. Key Takeaways

- BPE tokenization is **learned from data**
- It captures shared structure across related words
- It is used by modern language models
- It is intentionally outside NLTK and spaCy
