# BERT: Masked Language Modeling

- 📺 **Video:** [https://youtu.be/dya_QNFvtiQ](https://youtu.be/dya_QNFvtiQ)

## Overview
- Learn how BERT trains with masked language modeling (MLM) by predicting masked tokens.
- See why masking 15% of tokens and using bidirectional context yields rich representations.

## Key ideas
- **Input corruption:** replace random tokens with [MASK], random tokens, or keep original.
- **Bidirectional context:** attention attends to both left and right context.
- **Training objective:** cross-entropy over masked positions only.
- **Pre-training:** MLM plus next sentence prediction (or replacements) primes the encoder for fine-tuning.

## Demo
Train a toy masked language model with PyTorch on a tiny corpus, mirroring the lecture (https://youtu.be/6l-XtQV2iC0) mechanics.

In [1]:
import torch
from torch import nn

sentences = [
    ['[CLS]', 'the', 'cat', 'sleeps', '[SEP]'],
    ['[CLS]', 'the', 'dog', 'runs', '[SEP]'],
    ['[CLS]', 'a', 'dog', 'sleeps', '[SEP]']
]

vocab = {token for sent in sentences for token in sent} | {'[MASK]'}
word_to_id = {w: i for i, w in enumerate(sorted(vocab))}
ids = [[word_to_id[w] for w in sent] for sent in sentences]
tensor_inputs = torch.tensor(ids)
mask_token = word_to_id['[MASK]']

model = nn.Sequential(
    nn.Embedding(len(vocab), 16),
    nn.Linear(16, 16),
    nn.ReLU(),
)
output_layer = nn.Linear(16, len(vocab))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(model.parameters()) + list(output_layer.parameters()), lr=1e-2)

for epoch in range(1, 61):
    total_loss = 0.0
    for row in tensor_inputs:
        masked_row = row.clone()
        mask_idx = 2
        target = row[mask_idx].unsqueeze(0)
        masked_row[mask_idx] = mask_token
        embeddings = model(masked_row)
        logits = output_layer(embeddings)
        loss = criterion(logits[mask_idx].unsqueeze(0), target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if epoch % 15 == 0:
        print(f"epoch {epoch:2d} | loss {total_loss/len(sentences):.4f}")

with torch.no_grad():
    test = tensor_inputs[0].clone()
    test[2] = mask_token
    logits = output_layer(model(test))[2]
    predicted = logits.argmax().item()
    inv_vocab = {idx: word for word, idx in word_to_id.items()}
    print()
    print('Predicted token for masked position:', inv_vocab[predicted])


epoch 15 | loss 0.6522
epoch 30 | loss 0.6466
epoch 45 | loss 0.6466
epoch 60 | loss 0.6463

Predicted token for masked position: dog


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*