# Attention

- 📺 **Video:** [https://youtu.be/q7HY7tpWWi8](https://youtu.be/q7HY7tpWWi8)

## Overview
- Introduce attention as a mechanism for weighted averaging of value vectors using relevance scores.
- Understand how attention generalizes alignment ideas from machine translation.

## Key ideas
- **Query, key, value:** attention scores keys against the query to weight values.
- **Score functions:** dot-product, additive, and scaled dot-product determine focus.
- **Context vectors:** weighted sums condition each prediction on relevant inputs.
- **Differentiability:** attention weights are learned end-to-end via backpropagation.

## Demo
Calculate attention weights for a toy query-key set and observe how changing queries shifts focus, as illustrated in the lecture (https://youtu.be/sVhHhVgZ72E).

In [1]:
import numpy as np

queries = np.array([[1.0, 0.5]])
keys = np.array([[0.8, 0.6], [0.2, 0.9], [0.0, 0.4]])
values = np.array([[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]])

scale = np.sqrt(keys.shape[1])
logits = queries @ keys.T / scale
weights = np.exp(logits - logits.max())
weights /= weights.sum()
context = weights @ values

print('Attention weights:', weights.squeeze())
print('Context vector:', context.squeeze())

new_query = np.array([[0.1, 1.2]])
logits2 = new_query @ keys.T / scale
weights2 = np.exp(logits2 - logits2.max())
weights2 /= weights2.sum()
context2 = weights2 @ values

print()
print('With new query:')
print('Attention weights:', weights2.squeeze())
print('Context vector:', context2.squeeze())


Attention weights: [0.44313378 0.32236152 0.2345047 ]
Context vector: [0.56038613 0.43961387]

With new query:
Attention weights: [0.32961847 0.40751098 0.26287054]
Context vector: [0.46105375 0.53894625]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*