# Self-Attention

- 📺 **Video:** [https://youtu.be/10l2NXStROU](https://youtu.be/10l2NXStROU)

## Overview
- Explain self-attention as attention applied within a sequence to relate tokens to one another.
- Highlight how self-attention replaces recurrence with parallelizable operations.

## Key ideas
- **Shared queries/keys/values:** derived from the same input through learned projections.
- **Contextualization:** each token attends to others to build contextual representations.
- **Masking:** causal masks prevent leaking future tokens during language modeling.
- **Computational benefits:** self-attention enables long-range interactions with fewer sequential steps.

## Demo
Apply scaled dot-product self-attention to a sequence of token embeddings and show how masking restricts look-ahead, mirroring the lecture (https://youtu.be/VYBd44C1rBw).

In [1]:
import numpy as np

X = np.array([
    [0.5, 0.1, 0.3],
    [0.2, 0.4, 0.1],
    [0.7, 0.0, 0.2]
])
W_q = np.array([[0.3, 0.6], [0.5, 0.1], [0.2, 0.4]])
W_k = np.array([[0.4, 0.2], [0.1, 0.7], [0.3, 0.5]])
W_v = np.array([[0.6, 0.3], [0.4, 0.2], [0.1, 0.8]])

Q = X @ W_q
K = X @ W_k
V = X @ W_v

scale = np.sqrt(Q.shape[-1])
logits = Q @ K.T / scale
mask = np.triu(np.ones_like(logits), k=1) * -1e9
logits_masked = logits + mask
weights = np.exp(logits_masked - logits_masked.max(axis=-1, keepdims=True))
weights /= weights.sum(axis=-1, keepdims=True)
context = weights @ V

print('Self-attention weights (causal):')
print(weights)
print()
print('Contextualized representations:')
print(context)


Self-attention weights (causal):
[[1.         0.         0.        ]
 [0.50565661 0.49434339 0.        ]
 [0.33667649 0.33371378 0.32960973]]

Contextualized representations:
[[0.37       0.41      ]
 [0.33045253 0.31607476]
 [0.36637558 0.33340999]]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*