# Attention

- 📺 **Video:** [https://youtu.be/q7HY7tpWWi8](https://youtu.be/q7HY7tpWWi8)

## Overview
Introduces the attention mechanism, a game-changing concept first developed in the context of Neural Machine Translation by Bahdanau et al. (2015) The video likely starts by explaining the problem in translation that led to attention: in an encoder-decoder RNN model, the encoder compresses a source sentence into a single vector, which the decoder then uses to produce the target sentence.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- This was problematic for long sentences - the fixed vector bottleneck lost a lot of detail, and the decoder had no explicit access to specific parts of the source when generating a given target word.
- Attention was proposed as a solution where, for each word the decoder generates, it can attend to (focus on) different parts of the source sentence dynamically.
- The lecturer probably illustrates this with an example: translating “The cat sat on the mat” to French.
- When generating the word “chat” (cat in French), the attention mechanism will look at the source words and put high weight on “cat”.

## Demo

In [None]:
# Scaled dot-product attention (toy)
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

np.random.seed(0)
Q = np.random.randn(3, 4)  # 3 queries, dim 4
K = np.random.randn(5, 4)  # 5 keys, dim 4
V = np.random.randn(5, 6)  # 5 values, dim 6

scores = Q @ K.T / np.sqrt(Q.shape[-1])  # (3,5)
weights = softmax(scores, axis=-1)       # (3,5)
out = weights @ V                        # (3,6)
print("weights.shape:", weights.shape, "out.shape:", out.shape)


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*