# Position Encodings

- 📺 **Video:** [https://youtu.be/a8sTGth7PoU](https://youtu.be/a8sTGth7PoU)

## Overview
- Explain why transformers need position information and how sinusoidal encodings provide it.
- Compare sinusoidal and learned positional embeddings.

## Key ideas
- **Absolute positions:** add vectors representing positions to token embeddings.
- **Sinusoidal design:** allows extrapolation to longer sequences with smooth patterns.
- **Learned encodings:** model can adapt positions but may not generalize beyond training length.
- **Relative encodings:** capture pairwise distances directly for better long-range behavior.

## Demo
Generate sinusoidal position encodings and visualize how even and odd dimensions vary, as shown in the lecture (https://youtu.be/3kumehM_qRQ).

In [1]:
import numpy as np

def sinusoidal_encoding(max_len, d_model):
    positions = np.arange(max_len)[:, None]
    div_terms = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
    encodings = np.zeros((max_len, d_model))
    encodings[:, 0::2] = np.sin(positions * div_terms)
    encodings[:, 1::2] = np.cos(positions * div_terms)
    return encodings

enc = sinusoidal_encoding(6, 8)
print('First 6 position encodings (dim=8):')
print(enc)

print()

print('Differences between position 1 and 2:')
print(enc[2] - enc[1])

First 6 position encodings (dim=8):
[[ 0.00000000e+00  1.00000000e+00  0.00000000e+00  1.00000000e+00
   0.00000000e+00  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.41470985e-01  5.40302306e-01  9.98334166e-02  9.95004165e-01
   9.99983333e-03  9.99950000e-01  9.99999833e-04  9.99999500e-01]
 [ 9.09297427e-01 -4.16146837e-01  1.98669331e-01  9.80066578e-01
   1.99986667e-02  9.99800007e-01  1.99999867e-03  9.99998000e-01]
 [ 1.41120008e-01 -9.89992497e-01  2.95520207e-01  9.55336489e-01
   2.99955002e-02  9.99550034e-01  2.99999550e-03  9.99995500e-01]
 [-7.56802495e-01 -6.53643621e-01  3.89418342e-01  9.21060994e-01
   3.99893342e-02  9.99200107e-01  3.99998933e-03  9.99992000e-01]
 [-9.58924275e-01  2.83662185e-01  4.79425539e-01  8.77582562e-01
   4.99791693e-02  9.98750260e-01  4.99997917e-03  9.99987500e-01]]

Differences between position 1 and 2:
[ 6.78264420e-02 -9.56449142e-01  9.88359141e-02 -1.49375874e-02
  9.99883336e-03 -1.49993750e-04  9.99998833e-04 -1.49999938e

## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*