# Word Piece and Byte Pair Encoding

- ðŸ“º **Video:** [https://youtu.be/WA16JelEkkg](https://youtu.be/WA16JelEkkg)

## Overview
- Learn subword tokenization algorithms that balance vocabulary size and coverage.
- Compare byte-pair encoding (BPE) and WordPiece for building vocabularies.

## Key ideas
- **Subwords:** break rare words into frequent units to handle open vocabulary.
- **BPE merges:** iteratively merge the most frequent symbol pairs.
- **WordPiece scoring:** choose merges maximizing likelihood on corpus.
- **Trade-offs:** vocabulary size affects model capacity and memory.

## Demo
Run a handful of BPE merges on a toy corpus and tokenize held-out words, following the lecture (https://youtu.be/b7kWQxSNJi0).

In [1]:
from collections import Counter

corpus = ['low', 'lowest', 'newer', 'wider']

def tokenize(word):
    return list(word) + ['</w>']

vocab = {tuple(tokenize(word)): count for word, count in Counter(corpus).items()}

def merge_vocab(vocab):
    pair_counts = Counter()
    for tokens, freq in vocab.items():
        for i in range(len(tokens) - 1):
            pair_counts[(tokens[i], tokens[i+1])] += freq
    if not pair_counts:
        return vocab, None
    best_pair = pair_counts.most_common(1)[0][0]
    new_vocab = {}
    for tokens, freq in vocab.items():
        merged = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == best_pair:
                merged.append(tokens[i] + tokens[i+1])
                i += 2
            else:
                merged.append(tokens[i])
                i += 1
        new_vocab[tuple(merged)] = freq
    return new_vocab, best_pair

merges = []
for _ in range(3):
    vocab, merge = merge_vocab(vocab)
    if merge:
        merges.append(merge)
        print('Merged pair:', merge)

print()
print('Final vocabulary tokens:')
for tokens in vocab.keys():
    print(tokens)


Merged pair: ('l', 'o')
Merged pair: ('lo', 'w')
Merged pair: ('e', 'r')

Final vocabulary tokens:
('low', '</w>')
('low', 'e', 's', 't', '</w>')
('n', 'e', 'w', 'er', '</w>')
('w', 'i', 'd', 'er', '</w>')


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*