# Using Transformers

- 📺 **Video:** [https://youtu.be/1Efx04lHa7w](https://youtu.be/1Efx04lHa7w)

## Overview
- Discuss how to fine-tune transformer encoders for downstream NLP tasks.
- Outline data preprocessing, classifier heads, and optimization details.

## Key ideas
- **Tokenization:** convert text to subword ids compatible with the model.
- **CLS representation:** pooled encoder output feeds a task-specific head.
- **Fine-tuning loop:** backpropagate through the entire model with a small learning rate.
- **Regularization:** dropout, weight decay, and early stopping prevent overfitting.

## Demo
Mock a transformer encoder with PyTorch and attach a classifier head to show the mechanics of fine-tuning described in the lecture (https://youtu.be/9HRtJNLa-HI).

In [1]:
import torch
from torch import nn

class MiniTransformerClassifier(nn.Module):
    def __init__(self, vocab_size=50, d_model=32, nhead=4, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, dim_feedforward=64, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.classifier = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Linear(d_model, 2)
        )
    def forward(self, input_ids):
        x = self.embedding(input_ids)
        encoded = self.encoder(x)
        cls_rep = encoded[:, 0]
        return self.classifier(cls_rep)

model = MiniTransformerClassifier()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

inputs = torch.tensor([[1, 5, 2, 9], [3, 7, 8, 0]])
labels = torch.tensor([1, 0])

for epoch in range(1, 6):
    logits = model(inputs)
    loss = criterion(logits, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    preds = logits.argmax(dim=-1)
    acc = (preds == labels).float().mean().item()
    print(f"epoch {epoch} | loss {loss.item():.4f} | acc {acc:.3f}")


epoch 1 | loss 0.9924 | acc 0.500
epoch 2 | loss 0.9254 | acc 0.500
epoch 3 | loss 0.8271 | acc 0.500
epoch 4 | loss 0.5911 | acc 0.500
epoch 5 | loss 0.4961 | acc 0.500


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*