## Regular Transformer Embeddings

In transformers, Embedding layers are used for converting the input and output tokens into a vector of dimension ($d_{model}$). The weight matrix of the embedding layer are shared between:


1.   Input Embedding
2.   Output Embedding
3.   Pre-softmax Linear Transformation

The output of the embedding layers are multiplied by $\sqrt{d_{model}}$.

In [1]:
import torch
import torch.nn as nn
import numpy as np

In [25]:
class TransformerEmbedding(nn.Module):
    def __init__(self, vocab_size, n_dim):
        super().__init__()
        self.n_dim = n_dim
        self.weights = nn.Parameter(torch.zeros((vocab_size, n_dim)))
        nn.init.uniform_(self.weights)

    def forward(self, x):
        return np.sqrt(self.n_dim) * self.weights[x]

In [26]:
embed=TransformerEmbedding(vocab_size=100, n_dim=6)

In [27]:
# Input: a batch of 3 sentences, each with 4 token IDs
token_ids = torch.tensor([
    [42, 10, 4, 2],    # First sentence
    [8, 76, 15, 2],    # Second sentence
    [24, 92, 7, 2]     # Third sentence
])

# Forward pass through embedding layer
embedded_tokens = embed(token_ids)

In [28]:
embedded_tokens.shape

torch.Size([3, 4, 6])

In [29]:
embed.weights.shape

torch.Size([100, 6])

The vocabulary has 100 tokens where each token has dimension 6.