Comments: Positional encoding is crucial in Transformer architectures because the original model lacks an inherent sense of sequence order.
- Absolute Position Embedding
Learnable
Absolute
- Sinusoidal Position Encoding
Fixed
Absolute
- Relative Position Encoding
Learnable
Relative
- Rotary Position Embedding
Fixed
Relative
Implementation
# in the embedding part
import torch
from torch import nn
position_ids = torch.arange(INPUT_LENGTH).unsqueeze(0)
position_embedding = nn.Embedding(MAX_POSITION, HIDDEN_SIZE)
position_embedding = position_embedding(position_id)
# Then add into word embedding
References
- Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).
- Liu, Yinhan et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv abs/1907.11692 (2019): n. pag.
Brief Introduction:
according to the Trigonometric Identities sin(a±b)=sin(a)cos(b)±cos(a)sin(b)
,cos(a+b)=cos(a)cos(b)-sin(a)sin(b)
, cos(a-b)=cos(a)cos(b)+sin(a)sin(b)
Property1:
let
Property2:
This means that sinusoidal position embeddings are unaware of direction
Implementation
import numpy as np
import torch
position_embedding = torch.zeros(MAX_POSITION, HIDDEN_SIZE).unsqueeze(0)
position_enc = np.array([[pos / np.power(10000, 2*(j//2)/HIDDEN_SIZE) for j in range(HIDDEN_SIZE)] for pos in range(MAX_POSITION)])
position_embedding[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
position_embedding[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
# inputs + PE
References
- Vaswani, Ashish et al. “Attention is All you Need.” Neural Information Processing Systems (2017).
- Yan, Hang et al. “TENER: Adapting Transformer Encoder for Named Entity Recognition.” ArXiv abs/1911.04474 (2019): n. pag.
- The Annotated Transformer , HarvardNLP's blog
Comments: The core of self-attention is dot-product
Brief Introduction
- Original Self-Attention
- Fuse Relative Position Information into Self-Attention
the relative position is actually pair-wise relationship between input elements, represented by vectors
- Transformations
Implementation 1
# in the attention part (clipping)
import torch
from torch import nn
position_ids_l = torch.arange(QUERY_LENGTH).view(-1, 1)
position_ids_r = torch.arange(KEY_LENGTH).view(-1, 1)
distance = position_ids_l - position_ids_r
distance_embedding = nn.Embedding(2*MAX_POSITION-1, HEAD_SIZE)
position_embedding = distance_embedding(distance + MAX_POSITION - 1) # lrd
relative_position_scores = torch.matmul(QUERY, position_embedding) # bhld @ lrd -> bhlr
attention_scores = attention_scores + relative_position_scores
Implementation 2
# in attention part (sinusoidal)
import torch
from torch import nn
position_ids = torch.arange(INPUT_LENGTH-1, -1, -1.0)
position_embedding = 1 / (10000 ** (torch.arange(0.0, HIDDEN_SIZE, 2.0) / HIDDEN_SIZE))
position_embedding = torch.outer(position_embedding, position_ids).unsqueeze(1) # l1d
r = nn.Linear(HIDDEN_SIZE, HEAD * HEAD_SIZE, bias=False)
position_embedding = r(position_embedding)
position_embedding = position_embedding.view(INPUT_LENGTH, HEAD, HEAD_SIZE)
relative_position_score = torch.einsum("ibnd,jbnd->bnij", QUERY, position_embedding)
attention_scores = attention_scores + relative_position_score
References
- Shaw, Peter et al. “Self-Attention with Relative Position Representations.” North American Chapter of the Association for Computational Linguistics (2018).
- Dai, Zihang et al. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.” ArXiv abs/1901.02860 (2019): n. pag.
- Yang, Zhilin et al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” Neural Information Processing Systems (2019).
- Raffel, Colin et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” J. Mach. Learn. Res. 21 (2019): 140:1-140:67.
- He, Pengcheng et al. “DeBERTa: Decoding-enhanced BERT with Disentangled Attention.” ArXiv abs/2006.03654 (2020): n. pag.
- 让研究人员绞尽脑汁的Transformer位置编码, Jianlin Su's blog
Brief Introduction
The primary objective of RoPE (Rotary Position Embedding) is to identify an operation that enables the inner product to incorporate relative positional information effectively. i.e. find a solution of the equation
For detailed derivation, please refer to the original paper.
Implementation In a two-dimensional context, a complex number can be represented in the form of a matrix, which geometrically corresponds to a rotation vector
the rotary matrix could be a combination of several 2D rotary matrix
visualize the implementation
implementation of RoPE# in the embedding part (apply rotary position to QUERY and KEY)
import numpy as np
import torch
from torch import nn
position_ids = torch.arange(0, KEY_LENGTH)
position_embedding = nn.Embedding(MAX_POSITION, HEAD_SIZE)
position_embedding.weight.requires_grad = False
position_enc = np.array([[pos/np.power(10000, 2*(j//2)/dim) for j in range(dim)] for pos in range(n_pos)])
position_embedding.weight[:, 0::2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
position_embedding.weight[:, 1::2] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
sinusoidal_pos = position_embedding(position_ids)
sin, cos = sinusoidal_pos.chunk(2, dim=-1)
sin_pos = torch.stack([sin, sin], dim=-1).reshape_as(sinusoidal_pos)
cos_pos = torch.stack([cos, cos], dim=-1).reshape_as(sinusoidal_pos)
rotate_half_QUERY = torch.stack([-QUERY[..., 1::2], QUERY[..., ::2]], dim=-1).reshape_as(QUERY)
QUERY = QUERY * cos_pos + rotate_hals_QUERY * sin_pos
rotate_half_KEY = torch.stack([-KEY[..., 1::2], KEY[..., ::2]], dim=-1).reshape_as(KEY)
KEY = KEY * cos_pos + rotate_half_KEY * sin_pos
References
- Su, Jianlin et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv abs/2104.09864 (2021): n. pag.
- Transformer升级之路:2、博采众长的旋转式位置编码, Jianlin Su's blog
- Rotary Embeddings: A Relative Revolution, Eleuther's blog
- Positional Encodings I. Main Approaches, Medium
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. See blog
Common methods for context length extension:
NTK derivation (simple version):
sinusoidal position at n
NTK want to combine Extrapolation for low frequency with Interpolation in high frequency, specifically, they introduce
we get
Another interpretation is
Reference
[1] Transformer升级之路:10、RoPE是一种β进制编码 Jianlin Su's blog