# Position Encodings

- 📺 **Video:** [https://youtu.be/a8sTGth7PoU](https://youtu.be/a8sTGth7PoU)

## Overview
Discusses how the Transformer han word order, given that self-attention by itself doesn't encode sequence position. The video explains Positional Encoding - the method used to inject position information into the word representations before applying self-attention It likely describes the specific approach from Vaswani et al.: adding sine and cosine functions of different frequencies to the embeddings of each position (so each position gets a vector of the same dimension as the model, with components that oscillate at different rates).

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- These sine/cosine values are fixed and deterministic, providing a unique signature for each position index.
- The lecturer might show the formula: PE(pos, 2i) = sin(pos/10000^(2i/d_model)), PE(pos, 2i+1) = cos(pos/10000^(2i/d_model)), and mention that this allows the model to learn to attend by relative positions (since any difference in positions corresponds to a deterministic phase shift in those sine waves) The idea is that the attention mechanism can pick up on patterns like “distance between two words” via these encodings.
- The video could also mention alternate learned positional embeddings (where you just have a position vector for each position up to max length) as another approach.
- Furthermore, it might touch on recent research cited (Press et al., 2021 and Kazemnejad et al., 2023) that investigates how positional biases affect a Transformer's ability to generalize to longer sequences than seen in training For instance, Press's “Train Short, Test Long” shows certain positional encoding schemes can extrapolate better The course may not dive deeply into those, but acknowledging them highlights that position encoding is an active design consideration.

## Demo

In [None]:
print('Try the exercises below and follow the linked materials.')

## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*