# Embedding

----

This chapter will introduce about the **Embedding**.


We have the tokenizer now. It will recieve a text, and convert it to a list of index. As we begin the tokenizer, we have a intro to build a dictionary. Yeah, what tokenizer do is to look up the dictionary and return the list of index. However this index is discreted. Take a example the number $1$ and the number $2$. Between them, we have many numbers like $1.1, 1.1123, 1.3, 1.5, 1.679, ...$. So we say the integer index is discreted. And this is not good for representing semantics. Like the word `cat`'s index is 100, and the index of word `dog` is 200, and the index of word `car` is 300. We know `cat` and `dog` is more similar, they are animals. But they `dog` has the same distance to `cat` and `car`. We need some method that is continuous to demenstrate those semantics, so there comes **embedding**.


In LLM, the embedding recieves the output of tokenizer's encode the index of the text, and embedding will generate a matrix or 2 dimension tensor. How it could be? In truth we can see embedding as a table, and in the table all are float-point number. They are continuous. And with a index you will find one row vector to it. It likes a index table. 

Now we will feel that.

---

## 1 token embedding

In [1]:
import torch
print("torch version:", torch.__version__)

torch version: 2.6.0+cu124


Let's import tokenizer and the prepared corpus

In [2]:
import sys

sys.path.append("../")

import ast
from module.code.tokenizer import Tokenizer

with open("../corpus/pretext.txt", 'r', encoding="utf-8") as file:
    str_vocab = file.read()

list_vocab = ast.literal_eval(str_vocab)

length = len(list_vocab)

tokenizer = Tokenizer(list_vocab)

In [3]:
# Set rand seed, and the result will easily repeat
torch.manual_seed(89)

embed = torch.rand((length, 4))

print(embed.shape)

for index in range(length):
    print(f"{index}: {embed[index]}")

torch.Size([134, 4])
0: tensor([0.9217, 0.2847, 0.5199, 0.6047])
1: tensor([0.0738, 0.3831, 0.8120, 0.5090])
2: tensor([0.4314, 0.4678, 0.0935, 0.5897])
3: tensor([0.0644, 0.6565, 0.9434, 0.9326])
4: tensor([0.5674, 0.8739, 0.7428, 0.2756])
5: tensor([0.8946, 0.3077, 0.7842, 0.6342])
6: tensor([0.3387, 0.6168, 0.4277, 0.4966])
7: tensor([0.0829, 0.2089, 0.8593, 0.9363])
8: tensor([0.3632, 0.3655, 0.4200, 0.4381])
9: tensor([0.5909, 0.2297, 0.5391, 0.3824])
10: tensor([0.4746, 0.7611, 0.1532, 0.8853])
11: tensor([0.2583, 0.0882, 0.6690, 0.6178])
12: tensor([0.3439, 0.8106, 0.7882, 0.7749])
13: tensor([0.3926, 0.1875, 0.8925, 0.5845])
14: tensor([0.5174, 0.0086, 0.7794, 0.7607])
15: tensor([0.9466, 0.6627, 0.4734, 0.8191])
16: tensor([0.4282, 0.4492, 0.6313, 0.9083])
17: tensor([0.9886, 0.2532, 0.3203, 0.8473])
18: tensor([0.4188, 0.9219, 0.8381, 0.3018])
19: tensor([0.1925, 0.4732, 0.1322, 0.7411])
20: tensor([0.2841, 0.9052, 0.3563, 0.1473])
21: tensor([0.7996, 0.0381, 0.4922, 0.6123])

In [4]:
text = "What can I hold you with?"
ids = tokenizer.encode(text)
print(ids)

[16, 35, 14, 58, 132, 127, 11]


print the first row vector

In [5]:
print(embed[0])

tensor([0.9217, 0.2847, 0.5199, 0.6047])


then, print the ids list Corresponding vector

In [6]:
print(embed[ids])

tensor([[0.4282, 0.4492, 0.6313, 0.9083],
        [0.6386, 0.5200, 0.8779, 0.3204],
        [0.5174, 0.0086, 0.7794, 0.7607],
        [0.2931, 0.6655, 0.5696, 0.2135],
        [0.4737, 0.5361, 0.4853, 0.9429],
        [0.9863, 0.3026, 0.6688, 0.8363],
        [0.2583, 0.0882, 0.6690, 0.6178]])


In [7]:
for id in ids:
    print(f"{id}: {embed[id]}")

16: tensor([0.4282, 0.4492, 0.6313, 0.9083])
35: tensor([0.6386, 0.5200, 0.8779, 0.3204])
14: tensor([0.5174, 0.0086, 0.7794, 0.7607])
58: tensor([0.2931, 0.6655, 0.5696, 0.2135])
132: tensor([0.4737, 0.5361, 0.4853, 0.9429])
127: tensor([0.9863, 0.3026, 0.6688, 0.8363])
11: tensor([0.2583, 0.0882, 0.6690, 0.6178])


Actually, like what we say above, it's a index table. With the index and you can find it's vector.

That's it! May you feel a little bit curious. Aha, then the embedding ends? It is and it is not. Actually in the LLM, the embedding will be trained. In practice, we will give it a sentence, and let it to predict the next word is. That's what the LLM do. Given some word, and predixt the next word, then append the new word to the original sentence, then go on predict the next word. So In the progress, we need to give the LLM with original sentence(input), and the next word. The next word is our target.

So we need to prepare them the original sentence and the next word. Actually in the LLM, we will do like this. Given the sentence **"What does LLM do to structure that"**.  We use a `sliding window`. Assuming the window can contain 4 words. And the first input is: **"What does LLM do"**, and it's target is: **"does LLM do to"**. Then the second input is: **"does LLM do to"**, corresponding target is: **"LLM do to structure"**. That's it, the window will move one step to generate the target based on the input.

And the description is drawed to a picture as below.

<center><img src="https://github.com/gzqccnu/img/blob/main/LLM-input-target.png?raw=true"></center>

We use `context_size` to replace how many words the sliding window contaian.

First of all, we import `re` for we must have the original text spliting to words

In [8]:
import re

In [9]:
context_size = 4
orig_stc = "What does LLM do to structure that"
preprocess_stc = re.split('(\s|[,.;:\'\"()])', orig_stc)

for i in range(len(preprocess_stc)):
    print(f"input:{preprocess_stc[i: i + context_size]} ==> target: {preprocess_stc[i + 1: i + 1 + context_size]}")

input:['What', ' ', 'does', ' '] ==> target: [' ', 'does', ' ', 'LLM']
input:[' ', 'does', ' ', 'LLM'] ==> target: ['does', ' ', 'LLM', ' ']
input:['does', ' ', 'LLM', ' '] ==> target: [' ', 'LLM', ' ', 'do']
input:[' ', 'LLM', ' ', 'do'] ==> target: ['LLM', ' ', 'do', ' ']
input:['LLM', ' ', 'do', ' '] ==> target: [' ', 'do', ' ', 'to']
input:[' ', 'do', ' ', 'to'] ==> target: ['do', ' ', 'to', ' ']
input:['do', ' ', 'to', ' '] ==> target: [' ', 'to', ' ', 'structure']
input:[' ', 'to', ' ', 'structure'] ==> target: ['to', ' ', 'structure', ' ']
input:['to', ' ', 'structure', ' '] ==> target: [' ', 'structure', ' ', 'that']
input:[' ', 'structure', ' ', 'that'] ==> target: ['structure', ' ', 'that']
input:['structure', ' ', 'that'] ==> target: [' ', 'that']
input:[' ', 'that'] ==> target: ['that']
input:['that'] ==> target: []


To encode it.

In [10]:
# Attention: here we must use the original setence as input
encoded_stc = tokenizer.encode(orig_stc)

for i in range(len(encoded_stc)):
    print(f"input:{encoded_stc[i: i + context_size]} ==> target: {encoded_stc[i + 1: i + 1 + context_size]}")

input:[16, 10, 10, 10] ==> target: [10, 10, 10, 116]
input:[10, 10, 10, 116] ==> target: [10, 10, 116, 10]
input:[10, 10, 116, 10] ==> target: [10, 116, 10, 110]
input:[10, 116, 10, 110] ==> target: [116, 10, 110]
input:[116, 10, 110] ==> target: [10, 110]
input:[10, 110] ==> target: [110]
input:[110] ==> target: []


mock the progress of generating.

In [11]:
for i in range(len(encoded_stc)):
    print(f"{encoded_stc[:i]} == > {encoded_stc[i]}")

[] == > 16
[16] == > 10
[16, 10] == > 10
[16, 10, 10] == > 10
[16, 10, 10, 10] == > 116
[16, 10, 10, 10, 116] == > 10
[16, 10, 10, 10, 116, 10] == > 110


In [12]:
for i in range(len(encoded_stc)):
    print(f"{tokenizer.decode(encoded_stc[:i])} ==> {tokenizer.decode([encoded_stc[i]])}")

 ==> What
What ==> <|unk|>
What <|unk|> ==> <|unk|>
What <|unk|> <|unk|> ==> <|unk|>
What <|unk|> <|unk|> <|unk|> ==> to
What <|unk|> <|unk|> <|unk|> to ==> <|unk|>
What <|unk|> <|unk|> <|unk|> to <|unk|> ==> that


Note that: here our corpus is a poetry, so it will look like very curious.

Let's has the first sentence of the poetry to have a try.

In [13]:
text = "What can I hold you with?"

encoded_text = tokenizer.encode(text)

for i in range(len(encoded_text)):
    print(f"{tokenizer.decode(encoded_text[:i])} ==> {tokenizer.decode([encoded_text[i]])}")

 ==> What
What ==> can
What can ==> I
What can I ==> hold
What can I hold ==> you
What can I hold you ==> with
What can I hold you with ==> ?


Here, we can also use `PyTorch`'s embedding. And it requires two parameters. First is the `vocabulary size` then is the `output_dim`. The `output_dim` dedicates how long the embeded vector is after embedding.

In [14]:
vocab_size = length
out_dim = 128

token_embedding_layer = torch.nn.Embedding(vocab_size, out_dim)

token_embedding = token_embedding_layer(torch.tensor(encoded_stc))

print(token_embedding)

tensor([[-8.5620e-01, -9.0666e-02, -2.2929e-01, -2.8675e-01,  1.2585e+00,
         -4.6703e-01,  4.9767e-01,  9.7761e-01, -6.9830e-01,  1.1584e+00,
         -6.2008e-01, -1.3283e+00, -1.6209e+00, -2.4759e+00, -6.3094e-02,
          6.2485e-02,  1.6153e+00,  5.6440e-01, -6.1894e-01, -1.6521e+00,
         -2.5804e-01, -4.0598e-01, -1.7543e+00, -6.0936e-01,  1.1187e+00,
         -6.3776e-01, -2.8997e-01,  2.6506e-01,  7.7799e-01, -1.1255e+00,
         -1.0479e-01, -9.3177e-01,  1.1920e+00,  1.4861e+00, -1.0375e+00,
          1.5590e+00,  4.8431e-01,  1.7005e+00, -1.9703e+00, -1.4964e-01,
          9.1853e-01, -8.2487e-01,  2.2310e+00, -5.8567e-01, -1.1222e+00,
         -8.3964e-01,  6.5284e-01, -4.3925e-01, -7.0254e-01, -1.2963e+00,
         -7.7522e-01,  1.0023e+00, -5.2820e-01,  1.1389e+00,  2.4697e+00,
          2.0219e+00,  1.5224e-01,  4.8015e-01, -1.7240e-01,  1.5409e+00,
         -1.2310e-01, -1.3589e+00,  4.5504e-01, -1.6065e+00,  1.7667e+00,
         -1.2207e+00, -1.2935e-01,  8.

---

## 2 position embedding

Just now, we implement the token embedding. And now we must to embed its position. In practice, there are two ways, such as **Learned Positional Embedding**, **Sinusoidal Positional Embedding** and **Rotary Position Embedding** also called **RoPE**

In this section, we talk about all of the three embedding methods.

---
### 2.1 Learned Position Embedding

This is the most simple position embedding, and it's implemention is also simple.

First, we initialize a group of learnable vectors.

Unlike token embedding recieve `vocab_size` and embed it to `out_dim`. Position embedding should recieve the `context_size` and embed it to `out_dim` for we are embedding the position. Here we must note: the token embedding's `out_dim` must be the same with position embedding's `out_dim`. Because in the later we will plus them. And the operated one is finally what we want. It contains the semantics also the position information.

In [15]:
context_size = 4

pos_embedding_layer = torch.nn.Embedding(context_size, out_dim)

pos_embedding = pos_embedding_layer(torch.arange(context_size))

Then we plus the token_embedding and pos_embedding. It's the final embedding of the `input`

In [16]:
try:
    input_embedding = token_embedding + pos_embedding
    print(input_embedding)
except Exception as e:
    print("error:", e)

error: The size of tensor a (7) must match the size of tensor b (4) at non-singleton dimension 0


F***, why there make a mistake? Encounting this, we must have to learn to read the error information. It said: the tensor don't match in the first dimension. And we have a check.

In [17]:
print("token_embedding shape:", token_embedding.shape)
print("pos_embedding shape:", pos_embedding.shape)

token_embedding shape: torch.Size([7, 128])
pos_embedding shape: torch.Size([4, 128])


And we see their shape don't match. Here actually casued for the `context_size`, we set it to 4. But the stence is not 4. Here I play a little joke. 

In [18]:
context_size = 7

pos_embedding_layer = torch.nn.Embedding(context_size, out_dim)
pos_embedding = pos_embedding_layer(torch.arange(context_size))

input_embedding = token_embedding + pos_embedding

print(input_embedding)

tensor([[ 0.4802, -1.0503, -1.6657, -0.9118, -0.2137, -0.4465, -0.3929,  1.7841,
          0.3638,  0.7323, -0.8227, -1.7517,  0.9578, -4.7607,  1.1408,  0.4597,
          2.1275, -0.1138, -2.9951, -2.5843,  0.6740, -0.9421, -0.9980,  1.5200,
         -0.5968, -0.3624,  0.8909, -0.6683, -0.0524, -1.9236, -2.0040, -0.4103,
          0.8952,  0.7189, -1.4367,  0.9294,  2.4446,  1.5465, -2.5476,  0.8322,
          1.3615, -0.9421,  2.6232,  0.2398,  0.0478, -1.3550,  0.4974, -1.3270,
         -0.8104, -2.0113, -1.9560,  1.0423, -0.2453,  1.7721,  1.9592,  1.3576,
          0.9166,  0.4446,  0.3640,  1.6002, -0.2627, -3.0990, -0.5889, -3.1357,
          0.4318, -1.1536,  0.7579,  0.0335, -0.3818,  2.9200,  1.2804,  2.4870,
          2.8983, -2.7241,  1.0430,  1.0653,  0.8239, -0.1541,  0.4653, -1.6424,
          1.3140, -0.8844, -0.1992,  0.4843,  2.8931, -0.2785,  0.4063,  1.3585,
          2.1602,  1.4551, -1.9558, -1.7038,  1.9775, -0.6636,  0.3163, -0.1401,
          1.4877,  0.5716,  

But there a question: what if another sentence come with different size? Then we should build another embeddingm and if we continously recieve sentence, our memory will overflow. So we should introduce some method to partition long sentence to fixed length piece. There we use `Dataset` and `DataLoader`. If you don't know what this is, This is [Dataset](../basic_pytorch/data/Dataset/Dataset.ipynb) and this is [DataLoader](../basic_pytorch/data/DataLoader/DataLoader.ipynb)

In [19]:
from torch.utils.data import Dataset

class CustomDatset(Dataset):
    
    # instantiate function
    def __init__(self, text, tokenizer, max_len, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(text)
        assert len(token_ids) > max_len, "Number of tokenized inputs must at %max_len + 1" % max_len

        for i in range(0, len(token_ids) - max_len, stride):
            input_chunk = token_ids[i : i + max_len]
            target_chunk = token_ids[i + 1 : i + 1 + max_len]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


Have the dataset, then we can use PyTorch's DataLoader to automatically generate data for us. It will call the Dataset's `__getitem__` method to return datas for us.

Here we don't use our own tokenizer. That means we don't use the corpus of that English poem. We use pre-trained tokenizer instead.

In [20]:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

In [21]:
def create_dataloader(
        text, batch_size=4, max_len=128, stride=4,
        shuffle=True, drop_last=True, num_workers=0,
        model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", cache_dir="../tokenizer/deepseek"
):
    # Initialize pre-trained tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
    
    # Create dataset, using our CustomDataset class
    dataset = CustomDatset(text, tokenizer, max_len, stride)

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Test

In [22]:
with open("../corpus/poetry.txt", "r") as file:
    raw_text = file.read()
print(raw_text)

What can I hold you with?

I offer you lean streets, desperate sunsets, the moon of the jagged suburbs.

I offer you the bitterness of a man who has looked long and long at the lonely moon.

I offer you my ancestors, my dead men, the ghosts that living men have honoured in bronze: my father's father killed in the frontier of Buenos Aires, two bullets through his lungs, bearded and dead, wrapped by his soldiers in the hide of a cow; my mother's grandfather --just twentyfour--heading a charge of three hundred men in Peru, now ghosts on vanished horses.

I offer you whatever insight my books may hold, whatever manliness or humour my life.

I offer you the loyalty of a man who has never been loyal.

I offer you that kernel of myself that I have saved, somehow --the central heart that deals not in words, traffics not with dreams, and is untouched by time, by joy, by adversities.

I offer you the memory of a yellow rose seen at sunset, years before you were born.

I offer you explanations of

In [23]:
dataloader = create_dataloader(
    raw_text, batch_size=1, max_len=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)

print(first_batch)

[tensor([[151646,   3838,    646,    358]]), tensor([[3838,  646,  358, 3331]])]


In [24]:
second_batch = next(data_iter)

print(second_batch)

[tensor([[3838,  646,  358, 3331]]), tensor([[ 646,  358, 3331,  498]])]


In [25]:
dataloader = create_dataloader(
    raw_text, batch_size=8, max_len=4, stride=4, shuffle=False 
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Inputs:\n", inputs)
print("\ntargets:\n", targets)

Inputs:
 tensor([[151646,   3838,    646,    358],
        [  3331,    498,    448,   1939],
        [    40,   3010,    498,  15651],
        [ 14371,     11,  27395,   7015],
        [  4917,     11,    279,  17788],
        [   315,    279,  26742,   3556],
        [ 46913,    382,     40,   3010],
        [   498,    279,  78996,    315]])

targets:
 tensor([[ 3838,   646,   358,  3331],
        [  498,   448,  1939,    40],
        [ 3010,   498, 15651, 14371],
        [   11, 27395,  7015,  4917],
        [   11,   279, 17788,   315],
        [  279, 26742,  3556, 46913],
        [  382,    40,  3010,   498],
        [  279, 78996,   315,   264]])


So, we can use the format data, they all have the same length. So they don't need to use so many Embedding, instead one.

Cause we have replace our tokenizer to the pretrained one. So it's vocab changed. We must use the pretrained's vocab, or we will encounter error of out of index for the tokenizered id exceeds our own vocab.

In [26]:
# get the vocab's size
# first we initialize the tokenizer
model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
cache_dir="../tokenizer/deepseek"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

vocab_size = len(tokenizer.get_vocab())
out_dim = 4
context_size = 4

token_embedding_layer = torch.nn.Embedding(vocab_size, out_dim)

pos_embedding_layer = torch.nn.Embedding(context_size, out_dim)

print("-"*100)
print("token_embedding_layer shape:", token_embedding_layer.weight.shape)
print("token_embedding_layer:", token_embedding_layer.weight)
print("-"*100)
print("pos_embedding_layer shape:", pos_embedding_layer.weight.shape)
print("pos_embedding_layer:", pos_embedding_layer.weight)

----------------------------------------------------------------------------------------------------
token_embedding_layer shape: torch.Size([151665, 4])
token_embedding_layer: Parameter containing:
tensor([[ 0.7939,  0.8734,  0.6884,  1.7565],
        [-1.5711,  0.4150,  1.3614, -0.6327],
        [-1.4541,  2.1737,  0.5489,  0.5346],
        ...,
        [ 0.0847,  0.3511, -0.0932, -0.6103],
        [-0.6879,  1.9135,  0.4878,  0.7240],
        [ 1.4361, -0.9227, -0.8750,  1.1611]], requires_grad=True)
----------------------------------------------------------------------------------------------------
pos_embedding_layer shape: torch.Size([4, 4])
pos_embedding_layer: Parameter containing:
tensor([[-0.7671, -0.1381, -1.2786,  0.1141],
        [-0.1188, -1.3545, -1.5097,  0.0086],
        [-1.2312, -0.3264, -1.4235, -0.1373],
        [-0.4512,  2.0651,  0.7018, -0.4913]], requires_grad=True)


In [27]:
token_embedding = token_embedding_layer(inputs)
pos_embedding = pos_embedding_layer(torch.arange(context_size))

input_embedding = token_embedding + pos_embedding

print("-" * 100)
print("token_embedding shape:", token_embedding.shape)
print("token_embedding:\n", token_embedding)
print("-" * 100)
print("pos_embedding shape:", pos_embedding.shape)
print("pos_embedding:\n", pos_embedding)
print("-" * 100)
print("inpu_embedding shape:", input_embedding.shape)
print("input embedding:\n", input_embedding)

----------------------------------------------------------------------------------------------------
token_embedding shape: torch.Size([8, 4, 4])
token_embedding:
 tensor([[[ 1.6524, -0.5745,  0.4926, -0.0387],
         [-0.7643, -0.6054, -3.2532, -0.1840],
         [ 0.1903,  0.3190,  0.4248,  0.8736],
         [ 0.8863, -0.0117, -0.8368,  0.9156]],

        [[-1.1193,  0.8943, -0.7088, -0.6777],
         [-0.5820,  2.1907, -1.4669, -1.0537],
         [ 0.9073,  0.4651, -0.6132, -0.7197],
         [ 0.5991,  2.3032,  0.7870,  0.5889]],

        [[-0.7504,  0.9921,  1.7085,  0.5891],
         [-0.1655, -0.8471, -0.2278,  0.7334],
         [-0.5820,  2.1907, -1.4669, -1.0537],
         [ 1.4607, -0.4790, -1.2052,  0.0586]],

        [[ 0.3028, -1.1045,  0.2325, -0.3734],
         [ 0.5084, -0.7684, -1.5536,  0.7362],
         [ 0.8154,  1.1250,  0.7514, -0.5589],
         [-0.3895,  1.0679,  1.5536, -1.5172]],

        [[-0.5075, -0.1102,  0.3038, -1.1680],
         [ 0.5084, -0.7684, -

---

### 2.2 Sinusoidal Position Embedding

In the original transformer's paper, it uses this embedding method.

And we first present the math mechanism and implement it in code.

Using `sin` and `cos` can indicate the relative position. We know `sin` and `cos` functions are periodic function. For a offset `k` and position `pos+k`'s pos embedding, we can demonstrate it by `pos`'s pos embedding. 

First we must to know the embedding expression: <br>
$$ PE_{(pos, 2i)} = \sin \big(\frac{pos}{1000^{2i / d}} \big) $$
$$ PE_{(pos, 2i + 1)} = \cos \big(\frac{pos}{1000^{2i + 1 / d}} \big) $$
in which $ \mathbf{i} $ is the dimension index. $ \mathbf{pos} $ is the position index. $ \mathbf{d} $ is the model dimension.

We have this principle:
$$ PE_{pos + k} = \textbf{M}_k \cdot PE_{pos} $$

Here's proof.

Using $\omega_i = \frac{1}{1000^{2i / d}}$, so
$$ PE_{(pos, 2i)} = \sin ( \omega_i * pos ) $$
$$ PE_{(pos, 2i + 1)} = \cos ( \omega_i * pos) $$
Then, consider position `pos + k`'s embedding(having the same dimension index)
$$ PE_{(pos + k, 2i)} = \sin ( \omega_i * (pos + k)) = \sin ( \omega_i * pos + \omega_i * k) $$
$$ PE_{(pos + k, 2i + 1)} = \cos ( \omega_i * (pos + k)) = \cos (\omega_i * pos + \omega_i * k) $$
let $ A = \omega_i * pos $, $B = \omega_i * k $. The original formula is equal to
$$ PE_{(pos+k, 2i)}   = \sin (ω_i * pos + ω_i * k) = \sin (ω_i * pos) \cos (ω_i * k) + \cos (ω_i * pos) \sin (ω_i * k) $$
$$ PE_{(pos+k, 2i+1)} = \cos (ω_i * pos + ω_i * k) = \cos (ω_i * pos) \cos (ω_i * k) - \sin (ω_i * pos) \sin (ω_i * k) $$
Look closely at the right side of the above two equations. They are exactly a linear combination of $ PE_{(pos, 2i)} $ and $ PE_{(pos, 2i 1)} $:
$$ PE_{(pos+k, 2i)}   = [\cos (ω_i * k)] * PE_{(pos, 2i)} + [\sin (ω_i * k)] * PE_{(pos, 2i+1)} \tag{1} $$
$$ PE_{(pos+k, 2i+1)} = [-\sin (ω_i * k)] * PE_{(pos, 2i)} + [\cos (ω_i * k)] * PE_{(pos, 2i+1)} \tag{2} $$

And we can use the matrix multiplication to simplify it.
let 
$$ 
\bf{M}_k^{(i)} = \begin{pmatrix}
            \cos (ω_i * k) & \sin (ω_i * k) \newline
            -\sin (ω_i * k) & \cos (ω_i * k) \newline
            \end{pmatrix} $$
It's a rotation matrix. More importantly, the matrix is only related to paramter: $ k $ the `offset`. It means: The model can be transformed by learning linear transformations $ \bf{M}_k $ to capture relative position relationships without explicitly training position parameters. If you don't know about matirx, you just need to know the previous version equation `(1)` and `(2)` that don't use matrix and the meaning.

> More over, if you want to know about rotation matirx, please click here: [Introduction to Rotaion Matrix](https://articulatedrobotics.xyz/tutorials/coordinate-transforms/rotation-matrices-2d/)

Then we implement it.

In [28]:
from torch import nn
import math

class SinPosEmbedding(nn.Module):

    def __init__(
            self, d_model: int, max_len: int=5000
    ):
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even"

        # create pos index matrix
        position = torch.arange(max_len).unsqueeze(1) # (max_len, 1)

        # compute 1 / 1000^{2i / d} here d == d_model
        # Here we use the exp and ln to prevent overflow
        # 1000^{2i / d_model} = exp(ln(1000^{2i / d_model}))
        # = exp(2i / d_model * ln(1000))
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * (-math.log(1000.0) / d_model)
        )

        # compute pos embedding matrix (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # register as untrainable buffer
        self.register_buffer('pe', pe)
        self.d_model = d_model
        self.max_len = max_len
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x: input tensor in other words the original embeded token (batch_size, seq_len, d_model)
        return: tensor with pos embedding
        """
        seq_len = x.size(1)
        # check length of sequence
        if seq_len > self.max_len:
            raise ValueError(f"Sequence length {seq_len} exceeds max_len {self.max_len}")
        
        pos_encoding = self.pe[:seq_len, :]

        return x + pos_encoding.unsqueeze(0)


In [29]:
pos_encoder = SinPosEmbedding(d_model=512)

x = torch.randn(32, 100, 512)

encoded_x = pos_encoder(x)

print(encoded_x.shape)

torch.Size([32, 100, 512])


-- -
### 2.3 RoPE (Rotary Position Embedding)

It's now most the LLMs' embedding method.

Here we also present its math machanism and implement it in code.

<div class="alert alert-info">
    <h4>
    Note
    </h4>
    <p>
        Below is all prerequisite knowledge and math proof, if you are not interested in it, you can skip it. However in my advice, you can simply read it, cause I write it very very simple for you. No intermediate steps are omitted. So you can just read it. About this part of math mechanism, you can also read it here <a href="https://arxiv.org/pdf/2104.09864">Original Paper of RoPE</a>. The following proof can be seen as presengting some Omitted intermediate steps.
    </p>
</div>

<hr style="border: 0; height: 3px; background: linear-gradient(90deg, #ff0000, #00ff00); margin: 1em 0;">

0. Background:
    <br>

    RoPE is relative to a module named transformer. That we will introduce after this in [transformer](../transformer/transformer.ipynb). And here we simply talk about it. In transformer, it recieves the text, actually after the embedding i.e. word vector with the pos embedding. It will use three matrix: $ Q $, $ K $, $ V $ multiply every input text generating three new vector named: $ q $, $ k $, $ v $. Then compute the dot mutiplication of $ q $ and $ k $. Remeber that we have a the pos embedding on each word of the text. So the $ q $, $ k $ of the word also have it. Then we could begin derivation.

Before that, we should have some more math base. Below we will introduce.

1. Complex number:
    <br>

    Note: $ i = \sqrt{-1} $, $ i^{2} = -1 $.
    And the complex number's base format is: $ a + b i$, in which $ a $ and $ b $ all are real number. In which we call $ a $ the real part. $ b $ the imaginative part.
    We also have another notation of complex number. Do you know the god's formula? It like this: $ e^{i \pi} + 1 = 0 $. It comes from 
    $$ 
    e^{i \theta} = \cos \theta + i \sin \theta \tag{3} 
    $$ 
    which is called complex number's **complex exponential form** or called **Euler's formula**. If you let $ \theta $ equals to $ \pi $.
    $$
    e^{i \pi} = \cos \pi + i \sin \pi
    $$
    We know $ \cos \pi = -1 $, $ \sin \pi = 0 $, so we get $ e^{i \pi} + 1 = 0 $.
    Here, we take a example or deep into it. Like real number, we have Cartesian coordinate system. With it we can decide a point by its `x` axis coordinate and `y` axis coordinate. Also we can use this point to (0, 0) distance $ r $, and $ \theta $ the angle with the x-axis to decide its position. That's the same in complex number. But the Cartesian coordinate system of complex number we call it **Complex plane**. Its x axis is the same with Cartesian coordinate system. However the y axis's coordinate unit is $ i $. So the complex number $ 3 + 4 i $ in the complex plane like this:

    <center><img src=https://github.com/gzqccnu/img/blob/main/complex_plane.png?raw=true)></center>

    In which, $ r = \sqrt{3^{2} + 4^{2}} = 5 $. Generally, given a complex number $ a + b i $, its $ r $ equals to $ \sqrt{a^{2} + b^{2}} $. 

2. Conjugate complex number:
    <br>

    $ a + b i $'s conjugate complex number equals to $ a - b i $. That's to say: one complex number and its conjugate complex number they have the same real part but the opposite imaginary part. If we note $ x = a + b i $, then we have $ x^{*} = a - b i $.  We expand to **complex exponential form**. If we have a complex number $ a + bi $, 
    we can convert it to(assume $ \theta $ stands for its angle in complex plane ): $ \cos \theta + i \sin \theta $. And its conjugate complex number can expressed in: $ \cos \theta - i \sin \theta $.
3. Dot of complex numbers:
    <br>

    Consider we have two complex numbers: $ x = a + b i $ and $ y = c + d i $. Then we want to compute the dot mutiplication of $ x $ and $ y $.
    We have the principle:
    $$
    x \cdot y = \langle x, y* \rangle = (a + b i) \cdot (c - d i) = ac + b(-d)i^{2} = ac + bd \tag{4}
    $$

Note we begin with a $ \textbf{2D} $ case.
Consider we already have the $ \boldsymbol{x}_{q} $ and $ \boldsymbol{x}_{k} $. Their positions are respectly $ m $ and $ n $. Now we must to find a way or function:
$$
\boldsymbol{f} (\boldsymbol{vec}, pos)
$$
So we have 
$$
\boldsymbol{q}_{m} = \boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) \tag{5}
$$

$$
\boldsymbol{k}_{n} = \boldsymbol{f}_{k}(\boldsymbol{x}_{k}. n) \tag{6}
$$

for vector $ \boldsymbol{q}_{m} \cdot \boldsymbol{k}_{n} $ only depend on relative position $ m-n $. And now consider this case: we have vectors $ \boldsymbol{A} $ and vector $ \boldsymbol{B} $ are parallel to x-axis. Then we rotate $ \boldsymbol{A} $ by $ +30 $ degrees with $ \boldsymbol{B} $ by $ +40 $ degrees. And the angle between them is 10 degrees. Expressed in math is:
$$
\cos \langle \boldsymbol{A}, \boldsymbol{B} \rangle = \frac{\boldsymbol{A} \cdot \boldsymbol{B}}{||A|| \ ||B||}
$$
Analogy to position encoding, we see position $ m $ as a rotation angle $ m \theta $. So, the dot multiplication of $ \boldsymbol{q}_{m} $ and $ \boldsymbol{k}_{n} $ only depends on $ (m - n) \theta $ i.e. the relative position $ m - n $.

That's it. Now we formally entry proof.

Note the function only depends on relative position, that's to say, it has the property:
$$
\boldsymbol{q}_{m}^{\boldsymbol{T}} \boldsymbol{k}_{n} = \boldsymbol{f}(\boldsymbol{x}_{q}, m) \cdot \boldsymbol{f}(\boldsymbol{x}_{k}, n) = \langle \boldsymbol{f}(\boldsymbol{x}_{q}, m), \boldsymbol{f}(\boldsymbol{x}_{k}, n) \rangle = g(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, m - n) \tag{7}
$$
$ g $ is for better reading experience.

To solve the identity equation, we must to have some initial conditions:
$$
\boldsymbol{f}_{q} (\boldsymbol{x}_{q}, 0) = \boldsymbol{q} \tag{8}
$$
$$
\boldsymbol{f}_{k} (\boldsymbol{x}_{k}, 0) = \boldsymbol{k} \tag{9}
$$

Based on the **complex exponential form** and take advantage of the geometric meaning of vector in $ \textbf{2D} $ and its complex counter part,
decompose functions in Equations (5) and (6) into
$$
\boldsymbol{f}_{q} (\boldsymbol{q}, m) = R_{q}(\boldsymbol{x}_{q}, m) e^{i \Theta_{q}(\boldsymbol{x}_{q}, m)} \tag{9}
$$

$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = R_{k}(\boldsymbol{x}_{k}, n) e^{i \Theta_{k}(\boldsymbol{x}_{k}, n)} \tag{10}
$$
(6) into
$$
g(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m) = R_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m) e^{i \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m)}
$$

where $ R_{f} $ , $ R_{g} $ and $ \Theta_{f} $ , $ \Theta_{g} $ are the radical and angular components for $ \boldsymbol{f}_{\{q, k\}} $ and $ g $, respectively. Plug them into Equation (7), we get the relation:
$$
R_{q}(\boldsymbol{x}_{q}, m)R_{k}(\boldsymbol{x}_{k}, n) = R_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m) \tag{11}
$$
$$
\Theta_{k}(\boldsymbol{x}_{k}, n) - \Theta_{q}(\boldsymbol{x}_{q}, m) = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m) \tag{12}
$$

<hr style="border: 3px dashed #ccc; width: 100%;">

Here, I want to explain why in 
$$
\Theta_{k}(\boldsymbol{x}_{k}, n) - \Theta_{q}(\boldsymbol{x}_{q}, m) 
$$
genrate `negative sign`. Do you remember Equation (4). We say two complex number's dot product is equal to the product of the conjugate of one complex number and another complex number. Using **complex exponential form**, we have
$$
x = a + bi = \sqrt{a^{2} + b^{2}} (\cos \theta + i \sin \theta) = \sqrt{a^{2} + b^{2}} e^{i \theta}
$$
$$
y = c + di = \sqrt{c^{2} + d^{2}} (\cos \theta + i \sin \theta) = \sqrt{c^{2} + d^{2}} e^{i \theta}
$$
So we have
$$
y^{*} = c - di = \sqrt{c^{2} + d^{2}} (\cos \theta - i \sin \theta) \tag{13}
$$
And how its complex exponential form like? Do you remember formula (3)?
We could Solving by substitution method. 
We know for a complex number: $ a + b i$, its complex exponential form is: 
$$ 
r e^{i \theta} = r ( \cos \theta + i \sin \theta ) 
$$ 
It's conjugate form like:  
$$ 
r ( \cos \theta - i \sin \theta ) 
$$
If we use $ i(-\theta) $ replace $ \theta $ in that form, we get
$$
e^{i (-\theta)} = \cos (-\theta) + i \sin (-\theta) \tag{14}
$$
We know
$$
\cos (-\theta) = \cos \theta
$$
$$
\sin (-\theta) = - \sin \theta
$$
So equation (12) like
$$
e^{i (-\theta)} = \cos \theta - i \sin \theta
$$
So equation (13) equals to
$$
y^{*} = c - di = \sqrt{c^{2} + d^{2}} (\cos \theta - i \sin \theta) = \sqrt{c^{2} + d^{2}} e^{ - i \theta}
$$
According to the law of exponentiation, we obtain equation (11) and (12)
<hr style="border: 3px dashed #ccc; width: 100%;">
with the corresponding initial condition as:

$$
\boldsymbol{q} = ||\boldsymbol{q}|| e^{i \theta_{q}} = R_{q}(\boldsymbol{x}_{q}, 0) e^{i \Theta_{q}(\boldsymbol{x}_{q}, 0)}
$$
$$
\boldsymbol{k} = ||\boldsymbol{k}|| e^{i \theta_{k}} = R_{k}(\boldsymbol{x}_{k}, 0) e^{i \Theta_{k}(\boldsymbol{x}_{k}, 0)}
$$
where $ ||\boldsymbol{q}|| $, $ ||\boldsymbol{k}|| $ and $ \theta_{q} $, $ \theta_{k} $ are the radial and angular part of $ \boldsymbol{q} $ and $ \boldsymbol{k} $ on the $ \textbf{2D} $ plane.
Next, we set $ m = n $ in Equation (11) (12) and take into account initial conditions in Equation (8) (9):
$$
R_{q}(\boldsymbol{x}_{q}, m)R_{k}(\boldsymbol{x}_{k}, m) = R_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 0) = R_{q}(\boldsymbol{x}_{q}, 0)R_{k}(\boldsymbol{x}_{k}, 0) = ||\boldsymbol{q}|| \ ||\boldsymbol{k}|| \tag{15}
$$
$$
\Theta_{k}(\boldsymbol{x}_{k}, m) - \Theta_{q}(\boldsymbol{x}_{q}, m) = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 0) = \Theta_{k}(\boldsymbol{x}_{k}, 0) - \Theta_{q}(\boldsymbol{x}_{q}, 0) = \theta_{k} - \theta_{q} \tag{16}
$$
for we have
$$
R_{g}(\boldsymbol{x}_{q}, m) = R_{q}(\boldsymbol{x}_{q}, 0) = ||\boldsymbol{q}||
$$
$$
R_{k}(\boldsymbol{x}_{k}, n) = R_{k}(\boldsymbol{x}_{k}, 0) = ||\boldsymbol{k}||
$$
$$
R_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, n - m) = R_{g} (\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 0) = ||\boldsymbol{q}|| \ ||\boldsymbol{k}||
$$
interprets that the radial functions $ R_{q} $ ,$ R_{k} $ and $ R_{g} $ are independent from the position information.

For Equation (16), we do a transposition then get:
$$
\Theta_{q}(\boldsymbol{x}_{}, m) - \theta_{q} = \Theta_{k}(\boldsymbol{x}_{k}, n) - \theta_{k}
$$
indicates that the angular functions does not dependent on $ \boldsymbol{x} $
so the term 
$$
\Theta_{f}(\boldsymbol{x}_{\{q, k\}}, m ) - \theta_{\{q, k\}}
$$
is a function of $ m $ and we denote it as $ \phi(m) $, yielding:
$$
\Theta_{f}(\boldsymbol{x}_{\{q, k\}}, m ) = \phi(m) + \theta_{\{q, k\}} \tag{17}
$$
Further, by plugging $ n = m + 1$ to Equation (11) (12) we get
$$
\Theta_{k}(\boldsymbol{x}_{k}, m + 1) - \Theta_{q}(\boldsymbol{x}_{q}, m) = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 1)
$$
and consider the above Equation (17) we have
$$
(\phi(m + 1) + \theta_{k}) - (\phi(m) + \theta_{q}) = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 1)
$$
so we get
$$
\phi(m + 1) - \phi(m) = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 1) + \theta_{q} - \theta_{k}
$$
Since RHS is a constant irrelevanttom $ \phi(m) $ with continuous integer inputs produce an arithmetic progression, we set 
$$
\theta = \Theta_{g}(\boldsymbol{x}_{q}, \boldsymbol{x}_{k}, 1) + \theta_{q} - \theta_{k}
$$
then set $ m $ to $(1, 2, 3..) $, we get:
$$
\phi(1) - \phi(0) = \theta
$$
$$
\phi(2) - \phi(1) = \theta
$$
$$
\phi(3) - \phi(2) = \theta
$$
$$
\vdots
$$
$$
\phi(m) - \phi(m - 1) = \theta
$$
So we get:
$$
\phi(m) - \phi(0) = m \theta
$$
we set
$$
\phi(0) = \gamma
$$
we plus all above differential terms regarding $ \phi $
$$
\phi(m) - \phi(m - 1) = \theta \newline
+ \newline
\phi(m - 1) - \phi(m - 2) = \theta \newline
+ \newline
\vdots \newline
+ \newline
\phi(3) - \phi(2) = \theta \newline
+ \newline
\phi(2) - \phi(1) = \theta \newline
+ \newline
\phi(1) - \phi(0) = \theta
$$ 
finally we get
$$
\phi(m) = m \theta + \gamma \tag{18}
$$
where $ \theta $, $ \gamma \in \mathbb{R}$ are constants and $ \theta $ is non-zero.

Equation (18) reveal:
$$
\phi(m) = \Theta_{f}(\boldsymbol{x}_{\{q, k\}}, m ) - \theta_{\{q, k\}} = m \theta + \gamma
$$
so
$$
\Theta_{f}(\boldsymbol{x}_{\{q, k\}}, m ) = \theta_{\{q, k\}} + m \theta + \gamma
$$
so we have equation (9) and (10)
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = R_{q}(\boldsymbol{x}_{q}, m) e^{i \Theta_{q}(\boldsymbol{x}_{q}, m)} = R_{q}(\boldsymbol{x}_{q}, m) e^{i (\theta_{q} + m \theta + \gamma)} = ||\boldsymbol{q}|| e^{i (\theta_{q} + m \theta + \gamma)} \tag{19}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = R_{k}(\boldsymbol{x}_{k}, n) e^{i \Theta_{k}(\boldsymbol{x}_{k}, n)} = R_{k}(\boldsymbol{x}_{k}, n) e^{i (\theta_{k} + n \theta + \gamma)} = ||\boldsymbol{k}|| ^{i (\theta_{k} + n \theta + \gamma)} \tag{20}
$$
in view of
$$
||\boldsymbol{q}|| e^{i \theta_{q}} = \boldsymbol{q}
$$
$$
||\boldsymbol{k}||e^{i \theta_{k}} = \boldsymbol{k}
$$
we plug them into Equation (19) and (20), we get:
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = \boldsymbol{q} e^{i(m \theta + \gamma)}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = \boldsymbol{k} e^{i (n \theta + \gamma)}
$$
we set $ \gamma = 0 $,
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = \boldsymbol{q} e^{i m \theta} \tag{21}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = \boldsymbol{k} e^{i n \theta} \tag{22}
$$
we define 
$$
\boldsymbol{q} = \boldsymbol{f}(\boldsymbol{x}_{m}, 0) = \boldsymbol{W}_{q} \boldsymbol{x}_{m}
$$
$$
\boldsymbol{k} = \boldsymbol{f}(\boldsymbol{x}_{n}, 0) =\boldsymbol{W}_{k} \boldsymbol{x}_{n}
$$
where $ W_{q} $, $ W_{k} $ are all 2D matrix. Plug them into Equation (21) (22) we get:
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = (\boldsymbol{W}_{q} \boldsymbol{x}_{m}) e^{i m \theta} \tag{23}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = (\boldsymbol{W}_{k} \boldsymbol{x}_{n}) e^{i n \theta} \tag{24}
$$
So finally we get the position embedding function in $ \textbf{2D} $
<hr style="border: 3px dashed #ccc; width: 100%;">

Here we will introduce you a new theory a complex number can be expressed in **matrix** like:
$$
a + bi = \sqrt{a^{2} + b^{2}} e^{i \theta} = \sqrt{a^{2} + b^{2}} \begin{pmatrix}
             \cos \theta & - \sin \theta \newline
             \sin \theta & \cos \theta \newline
             \end{pmatrix} = \begin{pmatrix}
                             a & -b \newline
                             b & a \newline
                             \end{pmatrix} \tag{25}
$$
where
$$
\theta = \arctan (\frac{b}{a})
$$
Here we do not present strict mathematical derivation proofs, rather conduct validating proofs from algebraic operations perspective.

Assume we have two complex number:
$$
z_{1} = a + bi \tag{26}
$$
$$
z_{2} = c + di \tag{27}
$$
according to Equation (25) we get the equivalent form of Equation (26)
$$
z_{1} = \begin{pmatrix}
        a & - b \newline
        b & a \newline
        \end{pmatrix} \tag{28}
$$
and Equation (27)
$$
z_{2} = \begin{pmatrix}
        c & -d \newline
        d & c \newline
        \end{pmatrix} \tag{29}
$$
their product(not dot product) in Equation (26) (27) form, we get
$$
z_{1} z_{2} = (a + bi)(c + di) = ac + ad i + bc i + bd(i^{2}) = (ac - bd) + (ad + bc)i \tag{30}
$$
in Equation (28) (29) form, we get
$$
z_{1} z_{2} = 
        \begin{pmatrix}
        a & - b \newline
        b & a \newline
        \end{pmatrix} 
        \begin{pmatrix}
        c & -d \newline
        d & c \newline
        \end{pmatrix} = 
        \begin{pmatrix}
        ac - bd & - (ad + bc) \newline
        ad + bc & ac - bd \newline
        \end{pmatrix}
$$
and return it to complex number form, it equals to Equation (30)
<hr style="border: 3px dashed #ccc; width: 100%;">

According to Equation (25), we have new version of Equation (23) (24)
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = (\boldsymbol{W}_{q} \boldsymbol{x}_{m}) \begin{pmatrix}
                                                                                    \cos m \theta & - \sin m \theta \newline
                                                                                    \sin m \theta & \cos m \theta
                                                                                    \end{pmatrix}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = (\boldsymbol{W}_{k} \boldsymbol{x}_{n}) \begin{pmatrix}
                                                                                    \cos n \theta & - \sin n \theta \newline
                                                                                    \sin n \theta & \cos n \theta
                                                                                    \end{pmatrix}
$$
We can move $ e^{\{m, n\} \theta} $ forward, and the rotation matrix can also be moved forward.
$$
\boldsymbol{f}_{q}(\boldsymbol{x}_{q}, m) = \begin{pmatrix}
                                            \cos m \theta & - \sin m \theta \newline
                                            \sin m \theta & \cos m \theta
                                            \end{pmatrix} (\boldsymbol{W}_{q} \boldsymbol{x}_{m}) \tag{31}
$$
$$
\boldsymbol{f}_{k}(\boldsymbol{x}_{k}, n) = \begin{pmatrix}
                                            \cos n \theta & - \sin n \theta \newline
                                            \sin n \theta & \cos n \theta
                                            \end{pmatrix} (\boldsymbol{W}_{k} \boldsymbol{x}_{n}) \tag{32}
$$
cause all the above are in 2D dimension. So we can write Equation (31) (32) in this unified format
$$
\boldsymbol{f}_{\{q, k\}}(\boldsymbol{x}_{m}, m) = \begin{pmatrix}
                                                          \cos m \theta & - \sin m \theta \newline
                                                          \sin m \theta  & \cos m \theta
                                                          \end{pmatrix}
                                                          \begin{pmatrix}
                                                          W_{\{q, k\}}^{11} & W_{\{q, k\}}^ {12} \newline
                                                          W_{\{q, k\}}^{21} & W_{\{q, k\}}^{22}
                                                          \end{pmatrix}
                                                          \begin{pmatrix}
                                                          x_{m}^{(1)} \newline
                                                          x_{m}^{2}
                                                          \end{pmatrix} \tag{33}
$$
where $ (x_{m}^{(1)}, x_{m}^{(2)}) $ is $ x_{m} $ expressed in the 2D coordinate.
Then we promoted Equation (33) to a general form
$$
f_{\{q,k\}}(\boldsymbol{x}_m, m) = R_{\Theta, m}^d \, W_{\{q,k\}} \, \boldsymbol{x}_m
$$

where

$$
R_{\Theta, m}^d = \begin{pmatrix}
\cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \newline
\sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \newline
0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \newline
0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \newline
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \newline
0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \newline
0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2}
\end{pmatrix} \tag{34}
$$
$$
x = \begin{bmatrix}
    x_{1} \newline
    x_{2} \newline
    x_{3} \newline
    \vdots \newline
    x_{d} \newline
    \end{bmatrix} \tag{35}
$$
and we split matrix (34) to block matrix format
$$
R_{\Theta, m}^d = \begin{pmatrix}
\begin{pmatrix}
\cos m\theta_1 & - \sin m\theta_1 \newline
\sin m\theta_1 & \cos m\theta_1 \newline
\end{pmatrix}
& \begin{pmatrix} 0 & 0 \newline 0 & 0 \newline \end{pmatrix}
& \cdots
& \begin{pmatrix} 0 & 0 \newline 0 & 0 \newline \end{pmatrix} \newline[6pt]

\begin{pmatrix} 0 & 0 \newline 0 & 0  \newline \end{pmatrix}
& \begin{pmatrix}
\cos m\theta_2 & -\sin m\theta_2 \newline
\sin m\theta_2 & \cos m\theta_2 \newline
\end{pmatrix}
& \cdots
& \begin{pmatrix} 0 & 0 \newline 0 & 0 \end{pmatrix} \newline[6pt]

\vdots & \vdots & \ddots & \vdots \newline[6pt]

\begin{pmatrix} 0 & 0 \newline 0 & 0  \newline \end{pmatrix}
& \begin{pmatrix} 0 & 0 \newline 0 & 0  \newline \end{pmatrix}
& \cdots
& \begin{pmatrix}
\cos m\theta_{d/2} & -\sin m\theta_{d/2} \newline
\sin m\theta_{d/2} & \cos m\theta_{d/2} \newline
\end{pmatrix}
\end{pmatrix} \tag{36}
$$
with vector (35)
$$
\boldsymbol{x} = \begin{bmatrix}
    \begin{bmatrix}
    x_{1} \newline
    x'_{1} \newline 
    \end{bmatrix} \newline
    \begin{bmatrix}
    x_{2} \newline
    x'_{2} \newline 
    \end{bmatrix} \newline
    \vdots \newline
    \begin{bmatrix}
    x_{d / 2} \newline
    x'_{d / 2} \newline 
    \end{bmatrix}
    \end{bmatrix} \tag{37}
$$
we take a submatrix from matrix (36) except all zero matrix
$$
\boldsymbol{M}_{i} = \begin{pmatrix}
                     \cos m \theta_{i} & - \sin m \theta_{i} \newline
                     \sin m \theta_{i} & \sin m \theta_{i} \newline
                     \end{pmatrix} \ \text{or}
                     \begin{pmatrix}
                     0 & 0 \newline
                     0 & 0 \newline
                     \end{pmatrix}
$$
we see $ \boldsymbol{M}_{i} $ as a basic element of matrix (36)

we take a subvector from vector (37)
$$
\boldsymbol{x}_{i} = \begin{bmatrix}
                     x_{i} \newline
                     x'_{i} \newline
                     \end{bmatrix}
$$
denote it to $ \boldsymbol{x}_{i} $

we see $ \boldsymbol{x}_{i} $ as a basic element of vector $ \boldsymbol{x} $.

<a id="matrix-vector"></a>

<span style="color: red;">
    When matrix (36) multiply vector (37), the computing algorithm of them is: 
    Multiply each element in each row of the matrix by each element in x one by one. 
    That's to say we see a row of the matrix as a vector, and what we do is let the row vector 
    of the matrix multiply the vector x and the result is still the row vector of the matrix. 
    we can see it as below
</span>

$$
\boldsymbol{row\_vec}_{i} = \begin{bmatrix}
                 \boldsymbol{M}_{i, 1} & \boldsymbol{M}_{i, 2} & \cdots & \boldsymbol{M}_{i, d / 2}
                 \end{bmatrix}
$$
where $ i $ is row index of matrix, $ i \in \{1, 2, \cdots, d / 2\}$. 
$$
\boldsymbol{x} = \begin{bmatrix}
                 \boldsymbol{x}_{1} & \boldsymbol{x}_{2} & \cdots & \boldsymbol{x}_{d / 2}
                 \end{bmatrix}
$$
so the new row vector is
$$
\boldsymbol{new\_row\_vec}_{i} = \begin{bmatrix}
                           \boldsymbol{M}_{i, 1} * \boldsymbol{x}_{1} & \boldsymbol{M}_{i, 2} * \boldsymbol{x}_{2} & \cdots & \boldsymbol{M}_{i, d / 2} \boldsymbol{x}_{d / 2}
                           \end{bmatrix}
$$
so when matrix (36) multiply vector (37), the final result like
$$
R_{\Theta, m}^d \boldsymbol{x} = \begin{pmatrix}

                                    \begin{pmatrix}
                                    \cos m \theta_{1} & - \sin m \theta_{1} \newline
                                    \sin m \theta_{1} & \cos m \theta_{1} \newline
                                    \end{pmatrix}
                                    \begin{bmatrix}
                                    x_{1} \newline
                                    x'_{1} \newline
                                    \end{bmatrix} & \begin{pmatrix}
                                                    0 \newline
                                                    0 \newline
                                                    \end{pmatrix} & \cdots & \begin{pmatrix}
                                                                             0 \newline
                                                                             0 \newline
                                                                             \end{pmatrix} 

                                    \newline

                                    \begin{pmatrix}
                                    0 \newline
                                    0 \newline
                                    \end{pmatrix} & \begin{pmatrix}
                                                    \cos m \theta_{2} & - \sin m \theta_{2} \newline
                                                    \sin m \theta_{2} & \cos m \theta_{2} \newline
                                                    \end{pmatrix}
                                                    \begin{bmatrix}
                                                    x_{2} \newline
                                                    x'_{2} \newline
                                                    \end{bmatrix} & \cdots & \begin{pmatrix}
                                                                            0 \newline
                                                                            0 \newline
                                                                             \end{pmatrix}

                                    \newline

                                    \vdots & \vdots & \ddots & \vdots

                                    \newline

                                    \begin{pmatrix}
                                    0 \newline
                                    0 \newline
                                    \end{pmatrix} & \begin{pmatrix}
                                                    0 \newline
                                                    0 \newline
                                                    \end{pmatrix} & \cdots & \begin{pmatrix}
                                                                             \cos m \theta_{d / 2} & - \sin m \theta_{d / 2} \newline
                                                                             \sin m \theta_{d / 2} & \cos m \theta_{d / 2} \newline
                                                                             \end{pmatrix}
                                                                             \begin{bmatrix}
                                                                             x_{d / 2} \newline
                                                                             x'_{d / 2} \newline
                                                                             \end{bmatrix}

                                 \end{pmatrix} \tag{38}
$$
and we can see many zero-submatrix in it, and when computing we actually don't need to compute them, we just to compute the **diagonal part**, that will speed computation up. And we can also not storage the hole matrix with lots of zero-submatrix. That will free a lot of memory. 

<hr style="border: 0; height: 3px; background: linear-gradient(90deg, #ff0000, #00ff00); margin: 1em 0;">

Above is the ending of math proof. Here we give a more  computional efficient realization of multiplication of $ R_{\Theta}^{d} $ and $ \boldsymbol{x} \in \mathbb{R}^{d} $. From Equation (38) we know We just need to compute
$$
e = \boldsymbol{M}_{i} \ \boldsymbol{x}_{i} = \begin{pmatrix}
                                            \cos m \theta_{i} & - \sin m \theta_{i} \newline
                                            \sin m \theta_{i} & \cos m \theta_{i}
                                            \end{pmatrix}
                                            \begin{bmatrix}
                                            x_{i} \newline
                                            x'_{i}
                                            \end{bmatrix} \tag{39}
$$
on the **diagonal part**.

Notice we split vector $ \boldsymbol{x} $ to two elements in a group like Equation (37). So when implementing in code, we should do first. Then we expand the computation formula (39). We said above, when multiplying a matrix with a vector, we see the row of the matrix as a vector. Then we multiply the row vector with given vector.
$$
\boldsymbol{fir\_row} = \begin{bmatrix}
                 \cos m \theta_{i} & - \sin m \theta_{i}
                 \end{bmatrix} \newline
\boldsymbol{sec\_row} = \begin{bmatrix}
                 \sin m \theta_{i} & \cos m \theta_{i}
                 \end{bmatrix}
$$
then compute as
$$
\boldsymbol{fir\_row} \ \boldsymbol{x}_{i} = \cos m \theta_{i} \ x_{i} - \sin m \theta_{i} \ x'_{i} = \begin{pmatrix}
                                                                                                       x_{i} \newline
                                                                                                       - x'_{i}
                                                                                                       \end{pmatrix}
                                                                                                       \begin{pmatrix}
                                                                                                       \cos m \theta_{i} \newline
                                                                                                       \sin m \theta_{i}
                                                                                                       \end{pmatrix} \tag{40}
$$
$$
\boldsymbol{sec\_row} \ \boldsymbol{x}_{i} = \sin m \theta_{i} \ x_{i} + \cos m \theta_{i} \ x'_{i} = \begin{pmatrix}
                                                                                                       x_{i} \newline
                                                                                                       x'_{i}
                                                                                                       \end{pmatrix}
                                                                                                       \begin{pmatrix}
                                                                                                       \sin m \theta_{i} \newline
                                                                                                       \cos m \theta_{i}
                                                                                                       \end{pmatrix} \tag{41}
$$
for the entire one, we can do like this:
$$
R_{\Theta, m}^{d} \ \boldsymbol{x} = \begin{pmatrix}
                    x_{1} \newline
                    x_{1} \newline
                    x_{3} \newline
                    x_{4} \newline
                    \vdots \newline
                    x_{d - 1} \newline
                    x_{d} \newline
                    \end{pmatrix}
                    \otimes
                    \begin{pmatrix}
                    \cos m \theta_{1} \newline
                    \cos m \theta_{1} \newline
                    \cos m \theta_{2} \newline
                    \cos m \theta_{2} \newline
                    \vdots \newline
                    \cos m \theta_{d / 2} \newline
                    \cos m \theta_{d / 2} \newline
                    \end{pmatrix}
                    +
                    \begin{pmatrix}
                    - x_{2} \newline
                    x_{1} \newline
                    - x_{4} \newline
                    x_{3} \newline
                    \vdots \newline
                    - x_{d} \newline
                    x_{d - 1} \newline
                    \end{pmatrix}
                    \otimes
                    \begin{pmatrix}
                    \sin m \theta_{1} \newline
                    \sin m \theta_{1} \newline
                    \sin m \theta_{2} \newline
                    \sin m \theta_{2} \newline
                    \vdots \newline
                    \sin m \theta_{d / 2} \newline
                    \sin m \theta_{d / 2} \newline
                    \end{pmatrix}
$$

<hr style="border: 0; height: 3px; background: linear-gradient(90deg, #ff0000, #00ff00); margin: 1em 0;">
Hoo, the math part ends.

When implementing, we first give a function that wil go through how **RoPE** computing. Then we will encapsulate it into a class.

And here we should note, how $ \theta $ generate. Generally like this
$$
\theta = \frac{1}{base^{\frac{2i}{dim}}}
$$
where $ i $ is the dimension index. Always we set $ base $ to $ 10000 $, so
$$
\theta = \frac{1}{10000^{\frac{2i}{dim}}}
$$

In [30]:
def RoPE_func(x, base=10000):
    """

    Args:
        x: input tensor, shape [batch_size, seq_len, num_heads, head_dim]
           Note: Now you may not know what `num_heads` and `head_dim` is,
                 it's not important now, you just need what the shape like
        base: frequence base, default to 10000
    
    return:
        the tensor applied RoPE, shape is the same with input

    """
    # get shape info
    batch_size, seq_len, num_heads, head_dim = x.shape

    # generate position inidex (0 - seq_len - 1)
    # shape: [seq_len]
    position = torch.arange(seq_len, dtype=torch.float32, device=x.device)

    # generate dimension index (0 - head_dim / 2 - 1), you can analogy it to
    # the R_{\Theta, m}^{d}'s dimension, also the `d` in it.
    dim = torch.arange(head_dim // 2, dtype=torch.float32, device=x.device)

    # compute frequence parameter (theta_i)
    # theta_i = 1 / (base^{2i / head_dim})
    # shape: [head_dim // 2]
    freq = 1.0 / (base ** (2 * dim / head_dim))

    # compute each position and dimension's rotary angle
    # m * theta_i where m is position index
    # shape: [seq_len, head_dim // 2]
    angles = position[:, None] * freq[None, :]

    # compute cos and sin
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    # change shape for broadcasting, for x.shape = [batch_size, seq_len, num_heads, head_dim]
    # [seq_len, head_dim // 2] ==> [1, seq_len, 1, head_dim // 2]
    cos = cos[None, :, None, :]
    sin = sin[None, :, None, :]
    # we can also use `unsqueeze` function
    # first convertion: [seq_len, head_dim // 2] ==> [1, seq_len, head_dim // 2]
    # >>> cos = cos.unsqueeze(0)
    # second convertion: [1, seq_len, head_dim // 2] ==> [1, seq_len, 1, head_dim // 2]
    # >>> cos = cos.unsqueeze(2) 
    # so in all, we can do like this
    # >>> cos = cos.unsqueeze(0).unsqueeze(2)
    # so do with sin.
    # another realization
    # cos = cos.unsqueeze(0).unsqueeze(2)
    # sin = sin.unsqueeze(0).unsqueeze(2)

    # split input tensor x to two subsection
    # x0: even index section
    # x1: odd index section
    x0 = x[..., 0::2] # same with x = x[:, :, :, 0::2]
    x1 = x[..., 1::2]

    # apply rotary formula
    # x0_rot = x0 * cos - x1 * sin | reference to Equation (40)
    # x1_rot = x0 * sin + x2 * cos | reference to Equation (41)
    x0_rot = x0 * cos - x1 * sin
    x1_rot = x0 * sin + x1 * cos

    # interleaved combination of the two parts after rotation
    x_rot = torch.stack([x0_rot, x1_rot], dim=-1)
    x_rot = x_rot.reshape(x.shape)

    return x_rot

<div class="alert alert-warning">
    <h4>WARNING</h4>
    <p>
        <b>head_dim</b> must to be even, or unable to generate integer pairs of dimensions
    </p>
</div>

In [31]:
batch_size = 2
seq_len = 10
num_heads = 4
head_dim = 64

x = torch.randn(batch_size, seq_len, num_heads, head_dim)

x_rot = RoPE_func(x)

print("-" * 50)
print("Input shape:", x.shape)
print("Input:\n", x)
print("-" * 50)
print("Output shape:", x_rot.shape)
print("Output:\n", x_rot)

--------------------------------------------------
Input shape: torch.Size([2, 10, 4, 64])
Input:
 tensor([[[[-0.4219,  1.0965, -0.1230,  ..., -1.1313,  1.1512,  0.0749],
          [-0.0478,  0.1124, -0.8396,  ..., -0.5118,  0.2506, -1.2593],
          [ 0.5098, -0.2037, -0.3307,  ...,  0.6599,  1.4446, -0.3898],
          [-0.2465, -1.2241, -1.7328,  ..., -0.2628,  0.0987,  0.4318]],

         [[ 0.8173, -0.5064, -0.5369,  ..., -0.9785, -1.3159,  1.2071],
          [-1.2622,  1.2874,  1.3678,  ..., -0.2201,  1.0068,  0.3634],
          [-1.1442, -0.8554,  0.5581,  ..., -1.5964, -0.6175,  0.8353],
          [-0.7543, -0.1788, -1.1366,  ...,  0.6146,  1.3140, -0.1259]],

         [[-0.7148,  1.2496, -0.4481,  ..., -0.5338,  0.0083, -0.2982],
          [ 0.5944, -0.6552, -0.7020,  ..., -1.2831,  1.6571, -0.2123],
          [ 0.2385,  0.1594, -0.3189,  ...,  0.3165,  2.2159,  0.4421],
          [ 0.0330, -0.6653,  1.0141,  ..., -0.0921, -0.7405,  1.4728]],

         ...,

         [[-0.40

We encapsulate `rope` function to a class

In [32]:
class RoPE(nn.Module):

    def __init__(self, head_dim, base=10000):
        """

        Args:
            head_dim: head dimension must be even
            base: frequency base

        """
        super().__init__()

        # ensure head_dim is even
        assert head_dim % 2 == 0, "Head dimension must be even"
        
        self.head_dim = head_dim
        self.base = base

        # precompute theta_i
        # shape: [head_dim // 2]
        dim = torch.arange(head_dim // 2, dtype=torch.float32)
        self.freq = 1.0 / (base ** (2 * dim / head_dim))

    def forward(self, x):
        """
        Args:
            x: input tensor, shape [bacth_size, seq_len, num_heads, head_dim]
        """
        # get input tensor info
        batch_size, seq_len, num_heads, head_dim = x.shape

        # generate position index (0, seq_len - 1)
        position = torch.arange(seq_len, dtype=torch.float32, device=x.device)

        # compute each position and dim's rotary angle
        # m * theta_i, where m is position index
        # shape: [seq_len, head_dim // 2]
        angles = position[:, None] * self.freq[None, :].to(x.device)

        # compute cos and sin
        cos = torch.cos(angles)
        sin = torch.sin(angles)

        # reshape cos and sin for broadcasting
        # below is another implemention
        # >>> cos = cos.unsqueeze(0).unsqueeze(2)
        # >>> sin = sin.unsqueeze(0).unsqueeze(2)
        cos = cos[None, :, None, :]
        sin = sin[None, :, None, :]

        # split input to even index and odd index
        x0 = x[..., 0::2]
        x1 = x[..., 1::2]

        # apply rotary formula
        # x0_rot = x0 * cos - x1 * sin
        # x1_rot = x0 * sin + x1 * cos
        x0_rot = x0 * cos - x1 * sin
        x1_rot = x0 * cos + x1 * sin

        # interleaved combination of the two parts after rotation
        x_rot = torch.stack([x0_rot, x1_rot], dim=-1) # on dimension head_dim
        x_rot = x_rot.reshape(x.shape)

        return x_rot


In [33]:
batch_size = 2
seq_len = 10
num_heads = 4
head_dim = 64

rope = RoPE(head_dim=head_dim, base=10000)

x = torch.randn(batch_size, seq_len, num_heads, head_dim)

# apply RoPE
x_rot = rope.forward(x)

print("-" * 50)
print("Input shape:", x.shape)
print("Input:\n", x)
print("-" * 50)
print("Output shape:", x_rot.shape)
print("Output:\n", x_rot)

--------------------------------------------------
Input shape: torch.Size([2, 10, 4, 64])
Input:
 tensor([[[[-1.8583,  0.2212, -1.1290,  ..., -0.8682,  1.5962, -0.1677],
          [-0.6703,  0.9843, -0.9771,  ...,  0.5599, -0.3622, -1.0328],
          [-0.4438, -0.1608,  0.1577,  ..., -0.3047,  2.2329,  0.4616],
          [-0.9899,  0.0988,  0.4164,  ..., -2.1633, -0.5897,  2.0057]],

         [[-0.8417,  2.0367,  0.1312,  ...,  0.8909, -1.1462, -0.5552],
          [ 0.0460,  0.9578,  0.3740,  ..., -0.9421,  0.4441, -0.3186],
          [-1.9181,  1.4679, -1.3905,  ..., -0.5547, -0.0833, -0.9468],
          [-0.8776, -0.3287,  0.0851,  ..., -0.5063,  0.0654, -1.4231]],

         [[ 1.8964, -0.7096,  0.2442,  ..., -1.0329, -0.2469, -0.9901],
          [ 0.8872, -0.1786, -0.9266,  ..., -0.1241,  0.5860, -1.4444],
          [-1.9193,  0.6593,  2.2972,  ..., -2.0557, -0.4427, -0.5476],
          [-0.2967, -0.6527, -0.3180,  ..., -0.3312,  1.0724,  1.0232]],

         ...,

         [[ 0.15

> [!ATTENTION]
>
> `x_rot` is just position embedding, so you need to plus the original input `x`

In [34]:
embed_input = x + x_rot

print("embed_input shape:", embed_input.shape)
print("embed_input:\n", embed_input)

embed_input shape: torch.Size([2, 10, 4, 64])
embed_input:
 tensor([[[[-3.7166, -1.6371, -2.2580,  ..., -0.4860,  3.1925,  1.4285],
          [-1.3406,  0.3140, -1.9542,  ..., -0.2787, -0.7244, -1.3950],
          [-0.8875, -0.6046,  0.3155,  ...,  1.6502,  4.4657,  2.6944],
          [-1.9798, -0.8911,  0.8328,  ..., -2.9449, -1.1795,  1.4160]],

         [[-3.0102,  3.2958, -0.7963,  ...,  0.6125, -2.2923, -1.7014],
          [-0.7351,  1.7887,  0.7162,  ..., -0.4825,  0.8882,  0.1254],
          [-4.1897,  1.6667, -2.6439,  ...,  0.0793, -0.1665, -1.0302],
          [-1.0752, -1.0795,  0.4489,  ..., -0.0616,  0.1309, -1.3579]],

         [[ 1.7525, -2.1441,  0.1212,  ..., -1.4409, -0.4935, -1.2372],
          [ 0.6804, -0.7102, -0.8793,  ...,  2.8591,  1.1723, -0.8588],
          [-1.7201,  2.0575,  2.0738,  ..., -2.6212, -0.8853, -0.9904],
          [ 0.4203, -1.1228, -1.3958,  ...,  0.5535,  2.1446,  2.0959]],

         ...,

         [[ 0.9331, -1.5387,  0.2928,  ..., -0.6235,  0

---

## 3 Loading pretrained embedding

Here must to indicate: here we use above three embedding methods, but sometimes we will use `Word2Vec`, `GloVe` etc.

After embedding, we must to build a model, and train it. It would connect to `transformer` and other things. Here we don't do this, we just use the pretrained one.

Do you remember download one pretrained tokenzier? We will use its embedding also.

In [35]:
# from transformers import AutoModel
# before we imported AutoTokenizer, so hear, we just import AutoModel
from transformers import AutoModel

<div class="alert alert-info">
    <h4>Note</h4>
    <p>
        Below <code>embeddings</code> don't have <b>position embedding</b>, its just <b>token embedding</b>. <code>RoPE</code> is defined in the <b>transformer</b>'s forward function.
    </p>
</div>


Download the model. Code below can resume from breakpoint, so you can interupt it anytime then start it again.

In [36]:
cachedir = "./deepseek-1.5B"
modelname = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

text = "I hold you and you hold me"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(modelname, cache_dir=cachedir)
model = AutoModel.from_pretrained(modelname, cache_dir=cachedir, device_map=device if device.startswith("cuda") else "cpu",
    torch_dtype=torch.float16 if device.startswith("cuda") else torch.float32)
embeddings = model.get_input_embeddings()

# Tokenize the text
# tokens: ['I', 'Ġhold', 'Ġyou', 'Ġand', 'Ġyou', 'Ġhold', 'Ġme']
# Note: 'Ġ' stands for space here
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Find all positions of the token "Ġhold" (note the space prefix)
hold_positions = [i for i, token in enumerate(tokens) if token == "Ġhold"]
print("'hold' appears at positions:", hold_positions)

# Check if we found any occurrences
if not hold_positions:
    print("Error: 'hold' token not found in tokenized text")
    exit()

# Get token embeddings without position information
inputs = tokenizer(text, return_tensors="pt").to(device)
token_embeddings = embeddings(inputs["input_ids"])
print("\nToken embeddings (no position info):")
print(token_embeddings[0, hold_positions[0], :5])  # First occurrence of "hold"

# Get model output (with position information)
with torch.no_grad():
    outputs = model(inputs["input_ids"], output_hidden_states=True)

# Get the last hidden state (with position information)
final_hidden_states = outputs.last_hidden_state
print("\nFinal hidden states (with position info):")
print(final_hidden_states[0, hold_positions[0], :5])  # First occurrence of "hold"

# Compare the same token at different positions
if len(hold_positions) >= 2:
    print("\nDifference between positions:")
    diff = torch.abs(final_hidden_states[0, hold_positions[0]] - final_hidden_states[0, hold_positions[1]])
    print("Mean absolute difference:", diff.mean().item())
    print("First 5 differences:", diff[:5].tolist())
else:
    print("\nNot enough occurrences of 'hold' to compare positions.")


Tokens: ['I', 'Ġhold', 'Ġyou', 'Ġand', 'Ġyou', 'Ġhold', 'Ġme']
'hold' appears at positions: [1, 5]

Token embeddings (no position info):
tensor([ 0.0459,  0.0172, -0.0317,  0.0264, -0.0166], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward0>)

Final hidden states (with position info):
tensor([-1.0303, -3.3359,  2.7383,  1.6855,  1.1680], device='cuda:0',
       dtype=torch.float16)

Difference between positions:
Mean absolute difference: 1.92578125
First 5 differences: [0.24169921875, 2.48828125, 2.078125, 3.12890625, 0.12890625]
