# Using Transformers

- 📺 **Video:** [https://youtu.be/1Efx04lHa7w](https://youtu.be/1Efx04lHa7w)

## Overview
Discusses practical considerations and uses of Transformers in various NLP tasks. This might involve how to take a pre-trained Transformer and fine-tune it for a task, or how to set up training for one from scratch.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- Possibly it covers libraries and tools (like mentioning Hugging Face's Transformers library) that make it easier to apply these models.
- The video could walk through an example: say using a Transformer encoder for text classification (like BERT does) or using a Transformer decoder for language modeling.
- It might address questions like: how do you han very long inputs (segmentation or hierarchical processing if beyond the model's limit), how much compute/memory they require, etc.
- Another angle is explaining transfer learning with Transformers: e.g., BERT is essentially a transformer encoder trained as a language model (with a special objective) and then used for downstream tasks by fine-tuning.

## Demo

In [None]:
# Scaled dot-product attention (toy)
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

np.random.seed(0)
Q = np.random.randn(3, 4)  # 3 queries, dim 4
K = np.random.randn(5, 4)  # 5 keys, dim 4
V = np.random.randn(5, 6)  # 5 values, dim 6

scores = Q @ K.T / np.sqrt(Q.shape[-1])  # (3,5)
weights = softmax(scores, axis=-1)       # (3,5)
out = weights @ V                        # (3,6)
print("weights.shape:", weights.shape, "out.shape:", out.shape)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*