# Embeddings with Positional Encoding for Transformers

This notebook builds on our previous work with custom tokenizers and embeddings to implement positional encodings as used in transformer models. Unlike RNNs, transformers process all tokens in parallel and have no inherent understanding of sequence order. Positional encodings solve this by explicitly adding position information to token embeddings.

We'll cover:
1. **Token Embeddings**: Review from previous notebook
2. **Positional Encodings**: Implementation of sinusoidal position representations
3. **Embedding with Position**: Combining token embeddings with positional information
4. **Visualization**: Understanding how positional encodings work
5. **Practical Usage**: Applying these components in a transformer context

Let's first import the necessary libraries and our previously implemented components.

In [1]:
from typing import Dict, List, Union, Tuple, Optional, Any
import numpy as np
import matplotlib.pyplot as plt
import math

# Import our custom tokenizer and embedding from previous notebook
from utils.tokenizer import Tokenizer
from utils.embedding import Embedding

## 1. Positional Encoding Implementation

In the original "Attention is All You Need" paper, the authors used sinusoidal functions to create positional encodings. The key benefits of this approach are:

1. It can handle sequences of arbitrary length, even those longer than seen during training
2. It creates a unique pattern for each position
3. The relative positions have a consistent relationship in the encoding space

The formula for position encoding is:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Where:
- $pos$ is the position in the sequence
- $i$ is the dimension index
- $d_{model}$ is the embedding dimension

Let's implement this:

In [2]:
# Sample data for tokenizer
texts = [
    "the quick brown fox jumps over the lazy dog",
    "a quick brown fox jumps over a lazy dog",
    "the lazy dog sleeps all day"
]

# Create tokenizer
tokenizer = Tokenizer()

# Fit tokenizer on texts
tokenizer.fit_on_texts(texts)

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(texts)

embedding_dim = 5  # Small dimension for demonstration
embedding = Embedding(
    vocab_size=tokenizer.vocab_size,
    embedding_dim=embedding_dim,
    padding_idx=tokenizer.word_to_index[tokenizer.pad_token]
)

# Get embedding for a single token
token_idx = tokenizer.word_to_index.get("the", tokenizer.word_to_index[tokenizer.unk_token])
token_embedding = embedding(token_idx)

In [3]:
token_idxs = [tokenizer.word_to_index.get(word,\
                    tokenizer.word_to_index[tokenizer.unk_token]) 
              for word in texts[0].split()]
token_embeddings = [embedding(token_idx) for token_idx in token_idxs]

In [4]:
# 9 words/9 tokens with 5 embedding dimension
print(f'Dimension of senetence one in texts is \
({len(token_embeddings)},{len(token_embeddings[0])})')

Dimension of senetence one in texts is (9,5)


## Adding positional encoding

Since we have 5 dimensions in the embedding, so we need 5 positional encoding.

The standard sinusoidal positional encoding from the `"Attention is All You Need"` paper uses a specific pattern of sine and cosine functions. 

In [5]:
def get_positional_encoding(seq_length: int, d_model: int) -> np.ndarray:
    """
    Create sinusoidal positional encodings for sequences.
    
    Args:
        seq_length: Length of the sequence
        d_model: Dimensionality of the embeddings
        
    Returns:
        Positional encoding matrix of shape (seq_length, d_model)
    """
    # Position: # unique tokens
    position: np.ndarray = np.arange(seq_length)[:, np.newaxis]
    
    div_term: np.ndarray = np.exp(np.arange(0, d_model, 1) * -(np.log(10000.0) / d_model))

    pos_encoding: np.ndarray = np.zeros((seq_length, d_model))
    print(f'The shape of the positional encoding is {pos_encoding.shape}')
    pos_encoding[:, 0::2] = np.sin(position * div_term)[:, 0::2]
    pos_encoding[:, 1::2] = np.cos(position * div_term)[:, 1::2]
    
    return pos_encoding

In [6]:

# Get token embeddings for each token in the sequence
token_idxs: List[int] = [tokenizer.word_to_index.get(word, tokenizer.word_to_index[tokenizer.unk_token]) 
                        for word in texts[0].split()]
token_embeddings: np.ndarray = np.array([embedding(token_idx) for token_idx in token_idxs])

# Generate positional encodings, aka noumber of tokens
seq_length: int = len(token_embeddings)
pos_encodings: np.ndarray = get_positional_encoding(seq_length, embedding_dim)

# Add positional encodings to token embeddings
token_pos_embeddings: np.ndarray = token_embeddings + pos_encodings

The shape of the positional encoding is (9, 5)
