# Multi-Head Self-Attention

- 📺 **Video:** [https://youtu.be/nHXrdLMo8Uk](https://youtu.be/nHXrdLMo8Uk)

## Overview
Expands on self-attention by introducing the concept of multiple attention heads, as implemented in the Transformer model The video explains that instead of computing a single attention for each word, the Transformer uses h parallel attention heads (commonly 8 or so), each with its own projection matrices for queries, keys, and values. This means each head can focus on different aspects of the word relationships.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- For example, in a translation context, one head might align pronouns, another might han verb tense, another might connect noun-adjective pairs.
- In an English self-attention context, one head for a word might attend strongly to the subject for agreement, another head might attend to an object for semantic role, etc.
- The lecturer likely describes how multi-head attention is computed: each head yields its own context vector (output) for each word, and then these outputs are concatenated and linearly transformed to produce the final representation for that word in that layer This allows the model to capture different types of relationships simultaneously.
- The reading “Attention Is All You Need” is referenced heavily here; perhaps the video walks through one of the figures from that paper, showing multi-head attention schematically.

## Demo

In [None]:
# Scaled dot-product attention (toy)
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

np.random.seed(0)
Q = np.random.randn(3, 4)  # 3 queries, dim 4
K = np.random.randn(5, 4)  # 5 keys, dim 4
V = np.random.randn(5, 6)  # 5 values, dim 6

scores = Q @ K.T / np.sqrt(Q.shape[-1])  # (3,5)
weights = softmax(scores, axis=-1)       # (3,5)
out = weights @ V                        # (3,6)
print("weights.shape:", weights.shape, "out.shape:", out.shape)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*