**Appendix D – Relative Positional Encoding**

_This notebook contains all the sample code in appendix D._

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-mlp/blob/main/Appendix_D_relative_positional_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-mlp/blob/main/Appendix_D_relative_positional_encoding.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Setup

This project requires Python 3.10 or above:

In [1]:
import sys

assert sys.version_info >= (3, 10)

And PyTorch ≥ 2.6.0:

In [2]:
from packaging.version import Version
import torch

assert Version(torch.__version__) >= Version("2.6.0")

# Bias RPE

In [3]:
import torch
import torch.nn as nn

In [4]:
class MultiheadAttentionWithBiasRPE(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1, r_max=128):
        super().__init__()
        self.h = num_heads
        self.d = embed_dim // num_heads
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)
        ###### ADDED
        self.biases = nn.Parameter(torch.zeros(num_heads, 2 * r_max - 1))
        ######

    ###### ADDED
    def gather_biases(self, Lq, Lk):
        h, n_biases = self.biases.shape  # [h, 2 * r_max - 1]
        r_max = (n_biases + 1) // 2
        pos_q = torch.arange(Lq, device=self.biases.device)  # [0, ..., Lq - 1]
        pos_k = torch.arange(Lk, device=self.biases.device)  # [0, ..., Lk - 1]
        rel_pos = pos_q[:, None] - pos_k[None, :]  # [Lq, Lk] (contains i - j)
        rel_pos = rel_pos.clamp(-r_max + 1, r_max - 1) + r_max - 1  # [Lq, Lk]
        return self.biases[:, rel_pos]  # [h, Lq, Lk]
    ######

    def split_heads(self, X):
        return X.view(X.size(0), X.size(1), self.h, self.d).transpose(1, 2)

    def forward(self, query, key, value):
        q = self.split_heads(self.q_proj(query))  # (B, h, Lq, d)
        k = self.split_heads(self.k_proj(key))  # (B, h, Lk, d)
        v = self.split_heads(self.v_proj(value))  # (B, h, Lv, d) with Lv=Lk
        scores = q @ k.transpose(2, 3) / self.d**0.5  # (B, h, Lq, Lk)

        ###### ADDED
        b = self.gather_biases(query.size(-2), key.size(-2))
        scores = scores + b
        ######

        weights = self.dropout(scores.softmax(dim=-1))  # (B, h, Lq, Lk)
        Z = weights @ v  # (B, h, Lq, d)
        Z = Z.transpose(1, 2)  # (B, Lq, h, d)
        Z = Z.reshape(Z.size(0), Z.size(1), self.h * self.d)  # (B, Lq, h × d)
        return (self.out_proj(Z), weights)  # (B, Lq, h × d)

In [5]:
torch.manual_seed(42)
batch_size = 32
d_model = 512
Lq = 700
Lk = Lv = 800

Q = torch.randn(batch_size, Lq, d_model)
K = torch.randn(batch_size, Lk, d_model)
V = torch.randn(batch_size, Lv, d_model)

mha = MultiheadAttentionWithBiasRPE(embed_dim=d_model, num_heads=8, r_max=512)
output, weights = mha(Q, K, V)
output.shape

torch.Size([32, 700, 512])

# RoPE

This function pre-computes the sines and cosines for RoPE. It returns a tuple containing sine and cosine tables, each of shape `[max_len, d]`. This way, we can easily get the cosine of the angle for the _k_^th^ component of the token at a given position.

In [6]:
def precompute_rope_cos_sin(d, max_len, base=10_000):
    theta = base ** (-torch.arange(0, d, 2).float() / d)  # θₖ, shape: [d // 2]
    positions = torch.arange(max_len)  # p, shape: [max_len]
    freqs = torch.outer(positions, theta)  # p * θₖ, shape: [max_len, d // 2]
    freqs_twice = freqs.repeat_interleave(2, dim=-1)  # shape: [max_len, d]
    return freqs_twice.cos(), freqs_twice.sin()  # shape: both [max_len, d]

d, max_len = 64, 4_096
cos_theta, sin_theta = precompute_rope_cos_sin(d, max_len)

If you take a 2D vector (x, y) and rotate it by an angle _θ_, the result is (_x_ cos(_θ_) – _y_ sin(_θ_), _x_ sin(_θ_) + _y_ cos(_θ_)).

Let's see how we can apply the multiple 2D rotations required by RoPE in one shot. The token `t` is interpreted as a sequence of 2D coordinates for each subspace:

```
t = [x0, y0, x1, y1, x2, y2, ...]
```

First, we will swap each pair of coordinates and multiple the resulting horizontal coordinates by –1:

```
t' = [-y0, x0, -y1, x1, -y2, x2, ...]
```

Then, using the cosines and sines we precomputed for all rotation angles (depending on the subspace and the position, we can then compute RoPE like this:

```
t_rope = t * precomputed_cos + t' * precomputed_sin
```

Indeed, this will give us:

```
t_rope = [x0 * cos(θ0) - y0 * sin(θ0), y0 * cos(θ0) + x0 * sin(θ0),
          x1 * cos(θ1) - y1 * sin(θ1), y1 * cos(θ1) + x1 * sin(θ1), ...]
```

That's the result we want: each 2D subspace is rotated by the desired angle.

In [7]:
def rope_rotation(t, cos_theta, sin_theta):
    t_grouped = t.reshape(*t.shape[:-1], -1, 2)  # group pairs of dims
    t_swapped = t_grouped[..., [1, 0]]  # swap 2D axes: (x, y) -> (y, x)
    t_swapped[..., 0] *= -1  # for each pair, (y, x) -> (–y, x)
    t_rotated_half = t_swapped.flatten(start_dim=-2)  # [-y0, x0, -y1, x1,...]
    L = t.size(-2)
    return t * cos_theta[:L] + t_rotated_half * sin_theta[:L]  # same shape as t

In [8]:
torch.manual_seed(42)
batch_size, n_heads, Lq, Lk, d = 32, 8, 800, 800, 64
Q = torch.randn(batch_size, n_heads, Lq, d)
K = torch.randn(batch_size, n_heads, Lk, d)
Q_rope = rope_rotation(Q, cos_theta, sin_theta)
K_rope = rope_rotation(K, cos_theta, sin_theta)

In [9]:
import transformers.models.llama.modeling_llama as mll

config = mll.LlamaConfig(hidden_size = n_heads * d, num_attention_heads=n_heads,
                         max_position_embeddings=max_len)
rotary_emb = mll.LlamaRotaryEmbedding(config)
position_ids = torch.arange(Lq, device=Q.device).unsqueeze(0)
cos_theta, sin_theta = rotary_emb(Q, position_ids)  # Q needed for device/dtype
Q_rope, K_rope = mll.apply_rotary_pos_emb(Q, K, cos_theta, sin_theta)

# ALiBi

This function generates the ALiBi bias matrix for a given number of heads and sequence length. The output tensor has a shape of `[num_heads, seq_len, seq_len]`.

In [10]:
def get_alibi_biases(n_heads, seq_len):
    head = torch.arange(1, n_heads + 1, dtype=torch.float32)  # shape: [n_heads]
    slopes = torch.pow(2, -8 * head / n_heads)  # [n_heads]
    pos = torch.arange(seq_len)  # [seq_len]
    distance_matrix = -(pos[:, None] - pos[None, :]).abs()  # [seq_len, seq_len]
    return slopes.view(-1, 1, 1) * distance_matrix

In [11]:
biases = get_alibi_biases(n_heads=8, seq_len=4)
biases[0]

tensor([[ 0.0000, -0.5000, -1.0000, -1.5000],
        [-0.5000,  0.0000, -0.5000, -1.0000],
        [-1.0000, -0.5000,  0.0000, -0.5000],
        [-1.5000, -1.0000, -0.5000,  0.0000]])