# BERT: Model and Applications

- 📺 **Video:** [https://youtu.be/g96oi4ihc_E](https://youtu.be/g96oi4ihc_E)

## Overview
- Understand BERT's encoder architecture and how to fine-tune it for classification, QA, or tagging.
- Explore the [CLS] pooling strategy for sentence-level tasks.

## Key ideas
- **Stacked self-attention:** bidirectional layers contextualize every token.
- **[CLS] token:** acts as aggregate representation for classification heads.
- **Fine-tuning:** attach task-specific layers and update all weights.
- **Transfer:** pre-training dramatically reduces labeled data requirements.

## Demo
Warm-start a miniature transformer encoder and attach a classifier head to separate positive/negative sentences, echoing the lecture (https://youtu.be/oD2jPreLz1o).

In [1]:
import torch
from torch import nn

sentences = [
    ['[CLS]', 'great', 'film', '[SEP]'],
    ['[CLS]', 'boring', 'plot', '[SEP]'],
    ['[CLS]', 'wonderful', 'acting', '[SEP]'],
    ['[CLS]', 'dull', 'story', '[SEP]']
]
labels = torch.tensor([1, 0, 1, 0])

vocab = {token for sent in sentences for token in sent}
word_to_id = {w: i for i, w in enumerate(sorted(vocab))}
inputs = torch.tensor([[word_to_id[w] for w in sent] for sent in sentences])

embedding = nn.Embedding(len(vocab), 24)
encoder_layer = nn.TransformerEncoderLayer(d_model=24, nhead=4, dim_feedforward=48, batch_first=True)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
classifier = nn.Sequential(nn.LayerNorm(24), nn.Linear(24, 2))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(embedding.parameters()) + list(encoder.parameters()) + list(classifier.parameters()), lr=5e-3)

for epoch in range(1, 61):
    hidden = embedding(inputs)
    encoded = encoder(hidden)
    logits = classifier(encoded[:, 0])
    loss = criterion(logits, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 15 == 0:
        preds = logits.argmax(dim=-1)
        acc = (preds == labels).float().mean().item()
        print(f"epoch {epoch:2d} | loss {loss.item():.4f} | acc {acc:.2f}")


epoch 15 | loss 0.0120 | acc 1.00
epoch 30 | loss 0.0034 | acc 1.00
epoch 45 | loss 0.0020 | acc 1.00
epoch 60 | loss 0.0012 | acc 1.00


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*