https://huggingface.co/blog/designing-positional-encoding

In [15]:
import torch
from torch import nn
from transformers import AutoTokenizer

model_id = "openai-community/gpt2"
tok = AutoTokenizer.from_pretrained(model_id)

text = "The dog chased another dog"
tokens = tok(text, return_tensors="pt")["input_ids"]
print(tokens)

print(tok.vocab_size)
embedding = nn.Embedding(tok.vocab_size, 32)

emb = embedding(tokens)
qkv_linear = nn.Linear(32, 32*3, bias=False)
q, k, v = torch.tensor_split(qkv_linear(emb), 3, dim=-1)
mha = nn.MultiheadAttention(embed_dim=32, num_heads=4, batch_first=True)
out = mha(q, k, v, need_weights=False)[0]

dog1 = out[0, 1]
dog2 = out[0, 4]
print(dog1)
print(dog2)
print(torch.allclose(dog1, dog2))

tensor([[  464,  3290, 26172,  1194,  3290]])
50257
tensor([ 0.0248,  0.0074, -0.0206, -0.0999, -0.0039, -0.0763, -0.1299, -0.1456,
        -0.0880,  0.0881,  0.0426,  0.0459, -0.0709,  0.0687,  0.0449,  0.0836,
        -0.1523,  0.0542,  0.0756, -0.0649, -0.0209, -0.0451, -0.0571,  0.1215,
        -0.0255, -0.0319, -0.0295, -0.0034,  0.0197, -0.1797, -0.1374,  0.0840],
       grad_fn=<SelectBackward0>)
tensor([ 0.0248,  0.0074, -0.0206, -0.0999, -0.0039, -0.0763, -0.1299, -0.1456,
        -0.0880,  0.0881,  0.0426,  0.0459, -0.0709,  0.0687,  0.0449,  0.0836,
        -0.1523,  0.0542,  0.0756, -0.0649, -0.0209, -0.0451, -0.0571,  0.1215,
        -0.0255, -0.0319, -0.0295, -0.0034,  0.0197, -0.1797, -0.1374,  0.0840],
       grad_fn=<SelectBackward0>)
True


# Properties:
1. Unique encoding for position, regardless of sequence length. Same encoding if sequence of size 10 vs 100.
2. Linear relation b/t encoded positions. Encodings for p, p+k should be linear (number line 2 to 5 is 3).
3. Generalizes to longer sequences (longer sequences at inference than training)
4. Deterministic, learnable process
5. Extensible to multiple dimensions (images)

# Integer encoding:
* 1, 2, 3, ...
* very low snr b/c of scale compared to 0-clustered embedding values

# Binary position encoding:
* convert int position to binary, add to token embedding
* periodic: lsb cycles b/t 0 and 1 for each token, msb cycles at a much slower rate
* cons: jumpy discrete function

# Sinusoidal position encoding:
* $PE(\text{pos}, 2i) = \sin(\frac{\text{pos}}{10000^{2i/d}})$
* $PE(\text{pos}, 2i+1) = \cos(\frac{\text{pos}}{10000^{2i/d}})$
* evens are sin, odds are cos
* pos is the token position index, i is the ith component in the positional encoding vecotr, d is model dim, 10000 is $\theta$, base wavelength
* for 1 pos: increasing wavelength (divided by larger denom, period increases, freq decreases)
* relative position between positions is a rotation (linear transform by rotation matrix)


Cons:
* generate separate positional embedding vector and adds to token embedding
* relative position is encoded as rotation, but we still store abs position (not really necessary) by adding PE to TE
* $QK^T$ calculates affinities through dot product $a \cdot b = |a| |b| cos(\theta)$: rotate vectors instead of changing norm

# Rotary Positional Encoding:
* rotates each 2d-pair in q and k
* same higher freq rotation -> lower freq rotation
* relative because positional encoding rotates vectors
* if same position -> same rotation -> no change in dot product


$ R(q, p) = 
\begin{pmatrix}
M_1 &    &        &     \\
    & M_2 &       &     \\
    &    & \ddots &     \\
    &    &        & M_{d/2} \\
\end{pmatrix}
\begin{pmatrix}q_1 \\ q_2 \\ \vdots \\ q_d \end{pmatrix}
$

$ M_i = 
\begin{pmatrix}
cos(w_i p) & -sin(w_i p) \\
sin(w_i p) & cos(w_i p) \\
\end{pmatrix}$

$w_i = \frac{1}{10000^{2i/d}}$

$ R(q, p) = 
\begin{pmatrix} q_1 \\ q_2 \\ q3 \\ q4 \\ \vdots \\ q_{d-1} \\ q_d \end{pmatrix}
\begin{pmatrix} cos(p \theta_1) \\ cos(p \theta_1) \\ cos(p \theta_2) \\ cos(p \theta_2) \\ \vdots \\ cos(p \theta_{d/2}) \\ cos(p \theta_{d/2}) \end{pmatrix}
+ 
\begin{pmatrix} -q_2 \\ q_1 \\ -q4 \\ q3 \\ \vdots \\ -q_d \\ q_{d-1} \end{pmatrix}
\begin{pmatrix} sin(p \theta_1) \\ sin(p \theta_1) \\ sin(p \theta_2) \\ sin(p \theta_2) \\ \vdots \\ sin(p \theta_{d/2}) \\ sin(p \theta_{d/2}) \end{pmatrix}
$

RoFormer: https://arxiv.org/pdf/2104.09864  
Eleuther ai post: https://blog.eleuther.ai/rotary-embeddings/