# Transformer Language Modeling

- 📺 **Video:** [https://youtu.be/htyspM3FrMg](https://youtu.be/htyspM3FrMg)

## Overview
Focuses on using a Transformer for the task of Language Modeling, which is directly relevant to Assignment 3. The video likely describes how one sets up a Transformer (usually a decoder-only or a stack of transformer blocks with causal masking) to model text sequences.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- It might mention the example of GPT (Generative Pre-trained Transformer) which is essentially a many-layer transformer decoder trained on massive text data to predict the next word.
- It explains that unlike sequence-to-sequence tasks, for plain language modeling we don't need an encoder-decoder structure; we can just use the decoder part with self-attention masked to not see future words.
- During training, you feed in a long sequence and train it to predict each next token (like an extension of how we trained RNN LMs, but now with a transformer).
- The advantages could be reiterated: self-attention lets the model look at far-back context efficiently, and multiple layers allow complex conditioning.

## Demo

In [None]:
# Scaled dot-product attention (toy)
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

np.random.seed(0)
Q = np.random.randn(3, 4)  # 3 queries, dim 4
K = np.random.randn(5, 4)  # 5 keys, dim 4
V = np.random.randn(5, 6)  # 5 values, dim 6

scores = Q @ K.T / np.sqrt(Q.shape[-1])  # (3,5)
weights = softmax(scores, axis=-1)       # (3,5)
out = weights @ V                        # (3,6)
print("weights.shape:", weights.shape, "out.shape:", out.shape)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*