Skip to content

Latest commit

 

History

History
294 lines (225 loc) · 11.4 KB

PositionEncoding.md

File metadata and controls

294 lines (225 loc) · 11.4 KB

Position Encoding

Comments: Positional encoding is crucial in Transformer architectures because the original model lacks an inherent sense of sequence order.

Table of Contents

1. Absolute Position Embedding

Implementation

# in the embedding part
import torch
from torch import nn
position_ids = torch.arange(INPUT_LENGTH).unsqueeze(0)
position_embedding = nn.Embedding(MAX_POSITION, HIDDEN_SIZE)
position_embedding = position_embedding(position_id)
# Then add into word embedding

References

  • Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).
  • Liu, Yinhan et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv abs/1907.11692 (2019): n. pag.

2. Sinusoidal Position Encoding

Brief Introduction: $t$ is the position step, $d$ is the hidden size of the model, $i$ is the dimension of the hidden size

$$ \begin{cases} PE(t, 2i) &= sin(\frac{t}{10000^{2i/d}}) \\ PE(t, 2i+1) &= cos(\frac{t}{10000^{2i/d}}) \end{cases} $$

according to the Trigonometric Identities sin(a±b)=sin(a)cos(b)±cos(a)sin(b),cos(a+b)=cos(a)cos(b)-sin(a)sin(b), cos(a-b)=cos(a)cos(b)+sin(a)sin(b)

$$ \begin{cases} PE(t + k, 2i) &= PE(t, 2i) \times PE(k, 2i+1) + PE(t, 2i+1) \times PE(k, 2i) \\ PE(t + k, 2i+1) &= PE(t, 2i+1) \times PE(k, 2i+1) - PE(t, 2i) \times PE(k, 2i) \end{cases} $$

Property1: $PE_{t+k}$ can be represented as linear function of $PE_{t}$, Therefore, $PE_{t}^TPE_{t+k}$ is dependent on $k$, which indicates this kind of position encoding methods have the potentials of applying relative position

let $\theta_i = \frac{1}{10000^{2i/d}}$

$$ PE_{t} = \begin{bmatrix} sin(\theta_0t)\\ cos(\theta_0t)\\ \cdots\\ sin(\theta_{\frac{d}{2}-1}t) \\ cos(\theta_{\frac{d}{2}-1}t) \end{bmatrix} $$

Property2:

$$ \begin{align} PE_{t}^TPE_{t+k} &= \sum_{i=0}^{\frac{d}{2}-1}[sin(\theta_it)sin(\theta_i(t+k)) + cos(\theta_it)cos(\theta_i(t+k))] \\ & = \sum_{i=0}^{\frac{d}{2}-1}cos(\theta_i(t-(t+k))) \\ & = \sum_{i=0}^{\frac{d}{2}-1}cos(\theta_ik)\\ & = PE_{t'}^TPE_{t'+k} \\ & = PE_{t}^TPE_{t-k} \end{align} $$

This means that sinusoidal position embeddings are unaware of direction

Implementation

import numpy as np
import torch
position_embedding = torch.zeros(MAX_POSITION, HIDDEN_SIZE).unsqueeze(0)
position_enc = np.array([[pos / np.power(10000, 2*(j//2)/HIDDEN_SIZE) for j in range(HIDDEN_SIZE)] for pos in range(MAX_POSITION)])
position_embedding[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
position_embedding[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
# inputs + PE

References

  • Vaswani, Ashish et al. “Attention is All you Need.” Neural Information Processing Systems (2017).
  • Yan, Hang et al. “TENER: Adapting Transformer Encoder for Named Entity Recognition.” ArXiv abs/1911.04474 (2019): n. pag.
  • The Annotated Transformer , HarvardNLP's blog

3. Relative Position Encoding

Comments: The core of self-attention is dot-product

Brief Introduction

  • Original Self-Attention

$W^Q, W^K, W^V$ are parameter matrices, the attention score is calculated as $e_{ij}=\frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d}}$, where $d$ is the hidden size of single head, $x_i$ is the embedding $i^{th}$ token, here is a row vector. So the attention weight is calculated as $\alpha_{ij}=\frac{exp(e_{ij})}{ \sum\limits_{k=1}^{n} exp(e_{ij})}$. The output is $z_i= \sum\limits_{j=1}^{n} \alpha_{ij}(x_jW^V)$.

  • Fuse Relative Position Information into Self-Attention

the relative position is actually pair-wise relationship between input elements, represented by vectors $a_{ij}^K, a_{ij}^V$. We first add it into attention score $e_{ij}=\frac{(x_iW^Q)(x_jW^K + a_{ij}^K)^T}{\sqrt{d}}$, then add it into output $z_i= \sum\limits_{j=1}^{n} \alpha_{ij}(x_jW^V + a_{ij}^V)$. In practice, there will be clip operation.

  • Transformations

$$ \begin{align} q_ik_j^T &= ((x_i+p_i)W^Q)((x_j+p_j)W^K)^T \\ &= x_iW^Q{W^K}^Tx_j^T + x_iW^Q{W^K}^Tp_j^T+p_iW^Q{W^K}^Tx_j^T+p_iW^Q{W^K}^Tp_j^T\\ &\approx x_iW^Q{W^K}^Tx_j^T + x_iW^Q{W^K}^TR_{i-j}^T+uW^Q{W^K}^Tx_j^T+vW^Q{W^K}^TR_{i-j}^T\\ &\approx x_iW^Q{W^K}^Tx_j^T + x_iW^Q{W^{K,R}}^TR_{i-j}^T+u{W^K}^Tx_j^T+v{W^{K,R}}^TR_{i-j}^T \quad (XLNet)\\ &\approx x_iW^Q{W^K}^Tx_j^T + x_iW^Q{W^K}^TR_{i,j}^T+R_{j,i}{W^K}^Tx_j^T \quad (DeBERTa)\\ &\approx x_iW^Q{W^K}^Tx_j^T + \beta_{i,j} \quad (T5)\\ \end{align} $$

Implementation 1

# in the attention part (clipping)
import torch
from torch import nn
position_ids_l = torch.arange(QUERY_LENGTH).view(-1, 1)
position_ids_r = torch.arange(KEY_LENGTH).view(-1, 1)
distance = position_ids_l - position_ids_r
distance_embedding = nn.Embedding(2*MAX_POSITION-1, HEAD_SIZE)
position_embedding = distance_embedding(distance + MAX_POSITION - 1) # lrd
relative_position_scores = torch.matmul(QUERY, position_embedding) # bhld @ lrd -> bhlr
attention_scores = attention_scores + relative_position_scores

Implementation 2

# in attention part (sinusoidal)
import torch
from torch import nn
position_ids = torch.arange(INPUT_LENGTH-1, -1, -1.0)
position_embedding = 1 / (10000 ** (torch.arange(0.0, HIDDEN_SIZE, 2.0) / HIDDEN_SIZE))
position_embedding = torch.outer(position_embedding, position_ids).unsqueeze(1) # l1d
r = nn.Linear(HIDDEN_SIZE, HEAD * HEAD_SIZE, bias=False)
position_embedding = r(position_embedding)
position_embedding = position_embedding.view(INPUT_LENGTH, HEAD, HEAD_SIZE)
relative_position_score = torch.einsum("ibnd,jbnd->bnij", QUERY, position_embedding)
attention_scores = attention_scores + relative_position_score

References

  • Shaw, Peter et al. “Self-Attention with Relative Position Representations.” North American Chapter of the Association for Computational Linguistics (2018).
  • Dai, Zihang et al. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.” ArXiv abs/1901.02860 (2019): n. pag.
  • Yang, Zhilin et al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” Neural Information Processing Systems (2019).
  • Raffel, Colin et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” J. Mach. Learn. Res. 21 (2019): 140:1-140:67.
  • He, Pengcheng et al. “DeBERTa: Decoding-enhanced BERT with Disentangled Attention.” ArXiv abs/2006.03654 (2020): n. pag.
  • 让研究人员绞尽脑汁的Transformer位置编码, Jianlin Su's blog

4. Rotary Position Embedding

Brief Introduction The primary objective of RoPE (Rotary Position Embedding) is to identify an operation that enables the inner product to incorporate relative positional information effectively. i.e. find a solution of the equation $< f(q, m), f(k, n)>=g(q,k,m-n)$ Intuitively, we introduce complex number, let arbitrary $\theta \in (0, \frac{\pi}{2N}]$

$$ \begin{align} RoPE(x, m) &= xe^{im\theta}\\ < RoPE(q_j,m), RoPE(k_j,n)> &= < q_je^{im\theta}, k_je^{in\theta}>\\ &=(q_je^{im\theta})(k_je^{in\theta})^* \\ &=q_jk_je^{i(m-n)\theta}\\ &=RoPE(q_jk_j, m-n) \end{align} $$

For detailed derivation, please refer to the original paper.

Implementation In a two-dimensional context, a complex number can be represented in the form of a matrix, which geometrically corresponds to a rotation vector

$$ f(q, m) = qe^{im\theta}= \begin{pmatrix} \cos m\theta & -\sin m\theta\\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \end{pmatrix} $$

the rotary matrix could be a combination of several 2D rotary matrix

$$ \begin{pmatrix} \cos m\theta_0 & -\sin m\theta_0 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_0 & \cos m\theta_0 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_1 & -\sin m\theta_1 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_1 & \cos m\theta_1 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2-1} & -\sin m\theta_{d/2-1} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2-1} & \cos m\theta_{d/2-1} \\ \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{pmatrix} $$

visualize the implementation

implementation of RoPE
# in the embedding part (apply rotary position to QUERY and KEY)
import numpy as np
import torch
from torch import nn
position_ids = torch.arange(0, KEY_LENGTH)
position_embedding = nn.Embedding(MAX_POSITION, HEAD_SIZE)
position_embedding.weight.requires_grad = False
position_enc = np.array([[pos/np.power(10000, 2*(j//2)/dim) for j in range(dim)] for pos in range(n_pos)])
position_embedding.weight[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
position_embedding.weight[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
sinusoidal_pos = position_embedding(position_ids)
sin, cos = sinusoidal_pos.chunk(2, dim=-1)
sin_pos = torch.stack([sin, sin], dim=-1).reshape_as(sinusoidal_pos)
cos_pos = torch.stack([cos, cos], dim=-1).reshape_as(sinusoidal_pos)
rotate_half_QUERY = torch.stack([-QUERY[..., 1::2], QUERY[..., ::2]], dim=-1).reshape_as(QUERY)
QUERY = QUERY * cos_pos + rotate_hals_QUERY * sin_pos
rotate_half_KEY = torch.stack([-KEY[..., 1::2], KEY[..., ::2]], dim=-1).reshape_as(KEY)
KEY = KEY * cos_pos + rotate_half_KEY * sin_pos

References

4.1 NTK

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. See blog

Common methods for context length extension:

  • Extrapolation: direct
  • Interpolation: SuperHOT/PI: position n --> position n/k
  • NTK

NTK derivation (simple version):

sinusoidal position at n

$$ [cos(\frac{n}{\beta^0}), sin(\frac{n}{\beta^0}), \ldots, cos(\frac{n}{\beta^{d/2-1}}), sin(\frac{n}{\beta^{d/2-1}})] $$

NTK want to combine Extrapolation for low frequency with Interpolation in high frequency, specifically, they introduce $k$ to $\beta$, then making the value equal to the interpolation format

$$ \frac{n}{(\lambda\beta)^{d/2-1}} = \frac{n/k}{\beta^{d/2-1}} $$

we get $\lambda=k^{2/(d-2)}$

Another interpretation is $\beta$ base, see reference[1] for details

Reference

[1] Transformer升级之路:10、RoPE是一种β进制编码 Jianlin Su's blog