# Rotary Positional Embeddings (RoPE)

```{note}
Position encoding enables valuable supervision for dependency modeling between elements at different positions of the sequence.
Rotary Position Embedding(RoPE) is a novel method to effectively leverage the positional information.
```

## Background

### Preliminary

Let $\{w_{i}\}_{i=1}^{N}$ be a sequence of $N$ input tokens, the corresponding word embedding is denoted as $\{\mathbf{x}_{i}\}_{i=1}^{N}$, where $\mathbf{x}_{i}\in\mathbb{R}^{d}$ is the $d$-dimensional word embedding vector of token $w_{i}$ without position information. The self-attention first incorporates position information to the word embeddings and transforms them into queries, keys and value representation.

$$
\begin{aligned}
\mathbf{q}_{m} &= f_{q}(\mathbf{x}_{m}, m)\\
\mathbf{k}_{n} &= f_{k}(\mathbf{x}_{n}, n)\\
\mathbf{v}_{n} &= f_{v}(\mathbf{x}_{n}, n)
\end{aligned}
$$

The query and key values are then used to compute the attention weights, while the output is computed as the weighted sum over the value representation.

$$
\begin{aligned}
a_{m,n} &= \frac{\exp\left(\frac{\mathbf{q}_{m}^{T}\mathbf{k}_{n}}{\sqrt{d}}\right)}{\sum_{j=1}^{N}\exp\left(\frac{\mathbf{q}_{m}^{T}\mathbf{k}_{j}}{\sqrt{d}}\right)}\\
\mathbf{o}_{m} &= \sum_{n=1}^{N}a_{m,n}\mathbf{v}_{n}
\end{aligned}
$$

### Absolute position embedding

A typical choice of $f_{t:t\in\{q,k,v\}}$ is

$$f_{t:t\in\{q,k,v\}}(\mathbf{x}_{i}, i) := \mathbf{W}_{t:t\in\{q,k,v\}}(\mathbf{x}_{i} + \mathbf{p}_{i})$$

where $\mathbf{p}_{i}\in\mathbb{R}^{d}$ is a d-dimensional vector depending on $i$. The original transformer architecture proposed to generate $\mathbf{p}_{i}$ using the sinusoidal function.

$$
\begin{aligned}
\mathbf{p}_{i, 2t} &= \sin(i/10000^{2t/d}) \\
\mathbf{p}_{i, 2t+1} &= \cos(i/10000^{2t/d})
\end{aligned}
$$

where $i$ is the position and $2t,2t+1$ are dimensions.

* low dimension: $t$ small $\to$ $10000^{2t/d}$ small $\to$ high frequency.
* high dimension: $t$ large $\to$ $10000^{2t/d}$ large $\to$ low frequency.
* 10000 is the base period, increase base period allows for processing much larger sequences.

In the next section, we show that our proposed RoPE
is related to this intuition from the sinusoidal function perspective. However, instead of directly adding the position
to the context representation, RoPE proposes to incorporate the relative position information by multiplying with the
sinusoidal functions.

## Proposed approach

In this section, we discuss the proposed rotary position embedding (RoPE).

### Formulation

Transformer-based language modeling usually leverages the position information of individual tokens through a selfattention
mechanism. $\mathbf{q}_{m}^{T}\mathbf{k}_{n}$ typically enables knowledge transfer between
tokens at different positions. In order to incorporate relative position information, we require the inner product of query
$\mathbf{q}_{m}$ and key $\mathbf{k}_{n}$ to be formulated by a function $g$, which takes only the word embeddings $\mathbf{x}_{m}$, $\mathbf{x}_{n}$, and their relative position $m-n$ as input variables.

$$\langle f_{q}(\mathbf{x}_{m}, m),f_{k}(\mathbf{x}_{n}, n) \rangle = g(\mathbf{x}_{m}, \mathbf{x}_{n}, m-n)$$

### 2D case

We begin with a simple case with a dimension $d=2$, we make use of the geometric property of vectors on a 2D plane and its complex form to prove that a solution to our formulation Equation is:

$$
\mathbf{R}_{m,\theta} = \begin{pmatrix}
  \cos m\theta & -\sin m\theta \\
  \sin m\theta & \cos m\theta
\end{pmatrix}
$$

$$
\mathbf{R}_{m,\theta}^{T} = \mathbf{R}_{-m,\theta}
$$

$$
\mathbf{R}_{m,\theta}\mathbf{R}_{n,\theta} = \mathbf{R}_{m+n,\theta}
$$

then:

$$
\begin{aligned}
f_{q}(\mathbf{x}_{m}, m) &= \mathbf{R}_{m,\theta}\mathbf{W}_{q}\mathbf{x}_{m} \\
f_{k}(\mathbf{x}_{n}, n) &= \mathbf{R}_{n,\theta}\mathbf{W}_{k}\mathbf{x}_{n} \\
g(\mathbf{x}_{m}, \mathbf{x}_{n}, m-n) &= (\mathbf{W}_{q}\mathbf{x}_{m})^{T}\mathbf{R}_{n-m,\theta}(\mathbf{W}_{k}\mathbf{x}_{n})
\end{aligned}
$$

where $\mbox{Re}[\cdot]$ is the real part of a complex number and $(\mathbf{W}_{k}\mathbf{x}_{n})^{\ast}$ represents the conjugate complex number of $(\mathbf{W}_{k}\mathbf{x}_{n})$, $\theta\in\mathbb{R}$ is a non-zero constant. We can further write $f_{q}$ in a multiplication matrix:

$$f_{q}(\mathbf{x}_{m}, m) = 
\begin{pmatrix}
  \cos m\theta & -\sin m\theta \\
  \sin m\theta & \cos m\theta
\end{pmatrix}
\begin{pmatrix}
  W_{q}^{(11)} & W_{q}^{(12)} \\
  W_{q}^{(21)} & W_{q}^{(22)}
\end{pmatrix}
\begin{pmatrix}
  x_{m}^{(1)} \\
  x_{m}^{(2)}
\end{pmatrix}
$$

## General form

In order to generalize our results in 2D to any $\mathbf{x}_{i}\in\mathbb{R}^{d}$ where $d$ is even, we divide the d-dimensional space into $d/2$ sub-spaces:

$$
R_{m,\Theta}^{d} = 
\begin{pmatrix}
  \cos m\theta_{1} & -\sin m\theta_{1} & 0 & 0 & \dots & 0 & 0 \\
  \sin m\theta_{1} &  \cos m\theta_{1} & 0 & 0 & \dots & 0 & 0\\
  0 & 0 &  \cos m\theta_{2} & -\sin m\theta_{2}  & \dots & 0 & 0\\
  0 & 0 &  \sin m\theta_{2} &  \cos m\theta_{2}  & \dots & 0 & 0\\
  \vdots&  \vdots&  \vdots&  \vdots& \ddots &\vdots  &\vdots \\
  0&  0&  0&  0&  \dots&  \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\
  0&  0&  0&  0&  \dots&  \sin m\theta_{d/2} &  \cos m\theta_{d/2}
\end{pmatrix}
$$

is the rotary matrix with pre-defined parameters $\Theta = \{\theta_{i}=10000^{-2(i-1)/d},i\in[1,2,\dots,d/2]\}$. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative relative position dependency in self-attention formulation.

## Properties of RoPE

**Long-term decay**: we set $\theta_{i}=10000^{-2i/d}$. One can prove that this setting provides a long-term decay property, which means the inner-product will decay when the relative position increase.

**RoPE with linear attention**: