# Self-Attention

- 📺 **Video:** [https://youtu.be/10l2NXStROU](https://youtu.be/10l2NXStROU)

## Overview
Builds on the concept of attention to explain Self-Attention, where a sequence attends to itself. The key scenario introduced is likely the Transformer model (Vaswani et al., 2017) where self-attention is extensively used.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- The video explains that in self-attention, each word in a sequence looks at other words in the same sequence to gather context.
- For example, consider a sentence: “The bank near the river was overflowing.” To encode “bank” meaningfully, self-attention allows the model to look at “river” (and maybe “overflowing”) to understand “bank” here means river bank, not financial institution.
- Concretely, the video might describe how we compute self-attention: we have queries, keys, and values (linear projections of the word's embedding or hidden state).
- For each word as a query, we take dot products with all words' keys to get attention weights, and then use those to sum up value vectors, producing a new representation for that word that integrates information from the others It likely uses a small example to demonstrate: maybe 3-word sequence, showing how each word's new representation is a weighted mix of all three original word vectors.

## Demo

In [None]:
# Scaled dot-product attention (toy)
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

np.random.seed(0)
Q = np.random.randn(3, 4)  # 3 queries, dim 4
K = np.random.randn(5, 4)  # 5 keys, dim 4
V = np.random.randn(5, 6)  # 5 values, dim 6

scores = Q @ K.T / np.sqrt(Q.shape[-1])  # (3,5)
weights = softmax(scores, axis=-1)       # (3,5)
out = weights @ V                        # (3,6)
print("weights.shape:", weights.shape, "out.shape:", out.shape)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*