# Transformer Language Modeling

- 📺 **Video:** [https://youtu.be/htyspM3FrMg](https://youtu.be/htyspM3FrMg)

## Overview
- Use decoder-only transformers for autoregressive next-token prediction.
- Understand causal masking and iterative generation.

## Key ideas
- **Causal mask:** prevents the model from attending to future positions.
- **Token embedding + positional encoding:** sum to form input representations.
- **Decoding loop:** feed generated tokens back to predict the next.
- **Sampling strategies:** greedy, beam, or nucleus sampling adapt behavior.

## Demo
Construct a tiny decoder-only transformer in PyTorch and generate a short sequence from random weights to illustrate the mechanics described in the lecture (https://youtu.be/5o3pm2pIq5E).

In [1]:
import torch
from torch import nn

vocab_size = 20
seq_len = 6
d_model = 16
nhead = 4
num_layers = 2

decoder_layer = nn.TransformerDecoderLayer(d_model=d_model, nhead=nhead, dim_feedforward=32, batch_first=True)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
emb = nn.Embedding(vocab_size, d_model)
pos = nn.Parameter(torch.zeros(1, seq_len, d_model))
output_proj = nn.Linear(d_model, vocab_size)

start = torch.tensor([[1]])
for _ in range(seq_len - 1):
    length = start.shape[1]
    inputs = emb(start) + pos[:, :length]
    causal_mask = torch.triu(torch.ones(length, length), diagonal=1).bool()
    memory = torch.zeros_like(inputs)
    decoded = decoder(inputs, memory, tgt_mask=causal_mask)
    logits = output_proj(decoded[:, -1])
    next_token = torch.argmax(logits, dim=-1, keepdim=True)
    start = torch.cat([start, next_token], dim=1)

print('Generated token ids:', start.tolist())


Generated token ids: [[1, 0, 13, 0, 15, 0]]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*