# Demonstrating Positional Encoding in Transformers using PyTorch
## In this notebook, we will explore positional encoding, a crucial component of Transformer models that allows them to capture the order of tokens in a sequence. 

### **Positional Encoding Formula**
In Transformers, **positional encoding** provides information about the order of tokens in a sequence. The formula for computing positional encodings is:

$$
PE_{(pos, 2i)} = \sin\bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\bigg)
$$

$$
PE_{(pos, 2i+1)} = \cos\bigg(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\bigg)
$$

where:
- \( pos \) is the **position** of a token in the sequence.
- \( i \) is the **dimension index** in the embedding vector.
- \( d_{model} \) is the **embedding dimension**.
- The denominator scales the positions to different frequencies.

These encodings are **added** to input embeddings before passing them into the Transformer model.


In [1]:
import torch
import torch.nn as nn

# Define the maximum sequence length and the embedding dimension for a model:

max_sequence_length = 10: Specifies the maximum number of tokens a sequence can have. If a sequence is shorter, it may be padded; if longer, it may be truncated.

d_model = 6: Defines the size of each token’s embedding vector, meaning each token will be represented as a 6-dimensional vector.

In [2]:
max_sequence_length = 10
d_model = 6

# Initializes and returns a sequence of floating-point values for half of the embedding dimension:

Creates a tensor with values ranging from 0 to (d_model//2 - 1).

Converts it to a floating-point tensor.

In [3]:
i = torch.arange(0, d_model//2).float()
i

tensor([0., 1., 2.])

# Compute a denominator tensor used in positional encoding for Transformers:

torch.pow(10000, 2*i/d_model):

Raises 10000 to the power of (2*i / d_model), where i is a tensor of indices.

Helps in generating sinusoidal positional encodings with different wavelengths for different embedding dimensions.

In [4]:
denominator = torch.pow(10000, 2*i/d_model)
denominator

tensor([  1.0000,  21.5443, 464.1590])

# Create a tensor representing token positions in a sequence:

Generates a sequence of numbers from 0 to max_sequence_length - 1, represents the position of each token in the sequence.

Reshape the tensor into a column vector of shape (max_sequence_length, 1), making it compatible for further computations like positional encoding.

In [5]:
token_pos = torch.arange(max_sequence_length, dtype=torch.float).reshape(max_sequence_length, 1)
token_pos

tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]])

# Initialize a positional encoding matrix with zeros:

Create a tensor of shape (max_sequence_length, d_model), filled with zeros.

Acts as a placeholder for storing positional encodings for each token position.

In [6]:
PE = torch.zeros(max_sequence_length, d_model)
PE

tensor([[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]])

# Update the positional encoding matrix with sine values for even indices:

Compute the sine function of token_pos / denominator.

Assign the result to even-indexed columns (0, 2, 4, ...) of PE.

In [7]:
PE[:, 0::2] = torch.sin(token_pos / denominator)
PE

tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.8415,  0.0000,  0.0464,  0.0000,  0.0022,  0.0000],
        [ 0.9093,  0.0000,  0.0927,  0.0000,  0.0043,  0.0000],
        [ 0.1411,  0.0000,  0.1388,  0.0000,  0.0065,  0.0000],
        [-0.7568,  0.0000,  0.1846,  0.0000,  0.0086,  0.0000],
        [-0.9589,  0.0000,  0.2300,  0.0000,  0.0108,  0.0000],
        [-0.2794,  0.0000,  0.2749,  0.0000,  0.0129,  0.0000],
        [ 0.6570,  0.0000,  0.3192,  0.0000,  0.0151,  0.0000],
        [ 0.9894,  0.0000,  0.3629,  0.0000,  0.0172,  0.0000],
        [ 0.4121,  0.0000,  0.4057,  0.0000,  0.0194,  0.0000]])

# Update the positional encoding matrix with cosine values for odd indices:

Compute the cosine function of token_pos / denominator.

Assign the result to odd-indexed columns (1, 3, 5, ...) of PE.

In [8]:
PE[:, 1::2] = torch.cos(token_pos / denominator)
PE

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]])

# Implement a PositionalEncoding Class
## Write a PyTorch class PositionalEncoding, which:
### 1) Accepts parameters for embedding dimension (d_model) and maximum sequence length (max_len).
### 2) Defines a forward() method that
#### 2.1) Computes sinusoidal positional embeddings and stores them as a tensor.
#### 2.2) Returns the positional encodings.

In [9]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model
        self.PE = torch.zeros(max_sequence_length, d_model)

    def forward(self):
        i = torch.arange(0, self.d_model//2).float()
        denominator = torch.pow(10000, 2*i/self.d_model)
        token_pos = torch.arange(self.max_sequence_length).reshape(self.max_sequence_length, 1)
        self.PE[:, 0::2] = torch.sin(token_pos / denominator)
        self.PE[:, 1::2] = torch.cos(token_pos / denominator)
        return PE  

# Initializes a positional encoding with d_model=6 and max_sequence_length=10, then computes the encodings via forward().

In [10]:
pe = PositionalEncoding(d_model=6, max_sequence_length=10)
pe.forward() 

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]])