# Token Embeddings

In this notebook, we'll explore how Large Language Models (LLMs) convert tokens into numerical vectors called **embeddings**. We'll cover:

1. **Token Embeddings**: Converting token IDs into dense vectors
2. **Positional Embeddings**: Encoding position information
3. **Input Embeddings**: Combining token and positional embeddings

Let's start by importing the required libraries and loading our data.

In [67]:
import torch
import sys
import tiktoken
from pathlib import Path

# Add the parent directory to the path so we can import from core
blog_dir = Path.cwd().parent
sys.path.insert(0, str(blog_dir))

file_path = "../data/romeo_juliet_gutenberg.txt"

In [80]:
# Load the text data
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

print(f"Data '{file_path}' loaded successfully.")

Data '../data/romeo_juliet_gutenberg.txt' loaded successfully.


## Part 1: Simple Token Embeddings (Toy Example)

Before working with GPT-2 scale models, let's understand embeddings with a simple example.

**What are Token Embeddings?**
- Token embeddings convert discrete token IDs into continuous vector representations
- Each token gets mapped to a learned vector of fixed dimension
- Similar tokens will have similar embeddings after training

Let's create a toy vocabulary of 6 tokens and map them to 3-dimensional vectors.

In [81]:
# Example: A simple sequence of 4 token IDs
input_ids = torch.tensor([2, 3, 5, 1])

# Define our toy vocabulary and embedding dimensions
toy_vocab_size = 6        # Total number of tokens in vocabulary (0-5)
toy_embedding_dim = 3     # Each token will be represented by a 3D vector

# Create an embedding layer
# This creates a lookup table: toy_vocab_size rows x toy_embedding_dim columns
toy_embedding_layer = torch.nn.Embedding(toy_vocab_size, toy_embedding_dim)

print(f"Created embedding layer: {toy_vocab_size} tokens → {toy_embedding_dim}D vectors")

Created embedding layer: 6 tokens → 3D vectors


In [70]:
# View the embedding weight matrix
# Each row represents the embedding vector for one token
print("Embedding weight matrix (6 tokens × 3 dimensions):")
print(toy_embedding_layer.weight)
print(f"\nShape: {toy_embedding_layer.weight.shape}")

Embedding weight matrix (6 tokens × 3 dimensions):
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

Shape: torch.Size([6, 3])


In [82]:
# Embed a single token (token ID = 3)
# This retrieves the 4th row (index 3) from the embedding matrix
single_token_embedding = toy_embedding_layer(torch.tensor([2]))
print("Embedding for token ID 3:")
print(single_token_embedding)

Embedding for token ID 3:
tensor([[-1.5919,  0.8597, -0.6055]], grad_fn=<EmbeddingBackward0>)


In [83]:
# Embed our sequence of 4 tokens
# Each token ID is replaced with its corresponding 3D vector
toy_token_embeddings = toy_embedding_layer(input_ids)
print("Embeddings for sequence [2, 3, 5, 1]:")
print(toy_token_embeddings)
print(f"\nShape: {toy_token_embeddings.shape} → (sequence_length=4, embedding_dim=3)")

Embeddings for sequence [2, 3, 5, 1]:
tensor([[-1.5919,  0.8597, -0.6055],
        [ 0.3278, -1.2925,  2.9336],
        [ 0.7666,  2.5459, -1.1468],
        [-0.3735, -1.1398, -0.2094]], grad_fn=<EmbeddingBackward0>)

Shape: torch.Size([4, 3]) → (sequence_length=4, embedding_dim=3)


## Part 2: GPT-2 Scale Token Embeddings

Now let's work with GPT-2's actual vocabulary and embedding dimensions.

**GPT-2 Specifications:**
- Vocabulary size: 50,257 tokens
- Embedding dimension: 768 (for GPT-2 base) - we'll use 256 for this tutorial
- The embedding layer learns a unique vector for each token during training

In [84]:
# Load GPT-2 tokenizer to understand the vocabulary size
gpt2_tokenizer = tiktoken.get_encoding("gpt2")
print(f"GPT-2 vocabulary size: {gpt2_tokenizer.n_vocab:,}")

# Note: To explore newer tokenizers like GPT-4's, you can use:
# gpt4_tokenizer = tiktoken.get_encoding("o200k_base")

GPT-2 vocabulary size: 50,257


In [85]:
# Create GPT-2 scale token embedding layer
GPT2_VOCAB_SIZE = 50257      # GPT-2's vocabulary size
GPT2_EMBEDDING_DIM = 256     # Embedding dimension (GPT-2 small uses 768, we use 256 for tutorial)

token_embedding_layer = torch.nn.Embedding(GPT2_VOCAB_SIZE, GPT2_EMBEDDING_DIM)
print(f"Token embedding layer: {GPT2_VOCAB_SIZE:,} tokens → {GPT2_EMBEDDING_DIM}D vectors")
print(f"Total parameters: {GPT2_VOCAB_SIZE * GPT2_EMBEDDING_DIM:,}")

Token embedding layer: 50,257 tokens → 256D vectors
Total parameters: 12,865,792


### Create a Data Batch

Let's create a batch of tokenized sequences to work with. We'll use a context length of 4 tokens and batch size of 8.

In [86]:
#Import the data loader we created earlier
from core.data_loader import create_dataloader


# Define sequence parameters
CONTEXT_LENGTH = 4  # Maximum sequence length (number of tokens)
BATCH_SIZE = 8      # Number of sequences to process together

# Create dataloader with our text
dataloader = create_dataloader(
    text, 
    batch_size=BATCH_SIZE, 
    max_length=CONTEXT_LENGTH,
    stride=CONTEXT_LENGTH,  # No overlap between sequences
    shuffle=False
)

# Get one batch of data
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"Batch shape: {inputs.shape} → (batch_size={BATCH_SIZE}, sequence_length={CONTEXT_LENGTH})")

Batch shape: torch.Size([8, 4]) → (batch_size=8, sequence_length=4)


In [87]:
# Examine the token IDs in our batch
print("Token IDs:\n", inputs)
print(f"\nInputs shape: {inputs.shape}")

Token IDs:
 tensor([[  464,  4935, 20336, 46566],
        [  286, 43989,   290, 38201],
        [  198,   220,   220,   220],
        [  220,   198,  1212, 47179],
        [  318,   329,   262,   779],
        [  286,  2687,  6609,   287],
        [  262,  1578,  1829,   290],
        [  198,  1712,   584,  3354]])

Inputs shape: torch.Size([8, 4])


### Convert Token IDs to Token Embeddings

Now we'll convert each token ID into its corresponding embedding vector.

In [88]:
# Apply token embedding layer to our batch
token_embeddings = token_embedding_layer(inputs)

print(f"Token embeddings shape: {token_embeddings.shape}")
print(f"  → (batch_size={BATCH_SIZE}, sequence_length={CONTEXT_LENGTH}, embedding_dim={GPT2_EMBEDDING_DIM})")
print(f"\nEach of the {BATCH_SIZE} sequences has {CONTEXT_LENGTH} tokens,")
print(f"and each token is now a {GPT2_EMBEDDING_DIM}-dimensional vector.")

# Uncomment to see the actual embedding values (they're randomly initialized)
# print("\nToken embeddings:\n", token_embeddings)

Token embeddings shape: torch.Size([8, 4, 256])
  → (batch_size=8, sequence_length=4, embedding_dim=256)

Each of the 8 sequences has 4 tokens,
and each token is now a 256-dimensional vector.


## Part 3: Positional Embeddings

**Problem:** Token embeddings alone don't capture word order!
- The embeddings for "dog bites man" would be identical to "man bites dog"

**Solution:** Add positional embeddings
- Each position (0, 1, 2, 3, ...) gets its own learnable embedding
- These are added to token embeddings to inject position information

Let's create positional embeddings for our sequence.

In [89]:
# Create positional embedding layer
# We need one embedding for each possible position in our context
pos_embedding_layer = torch.nn.Embedding(CONTEXT_LENGTH, GPT2_EMBEDDING_DIM)

print(f"Positional embedding layer: {CONTEXT_LENGTH} positions → {GPT2_EMBEDDING_DIM}D vectors")
print(f"This means we can handle sequences up to {CONTEXT_LENGTH} tokens long")

# Uncomment to see the positional embedding weights
# print("\nPositional embedding weights:")
# print(pos_embedding_layer.weight)

Positional embedding layer: 4 positions → 256D vectors
This means we can handle sequences up to 4 tokens long


In [91]:
# Generate positional embeddings for positions [0, 1, 2, 3]
position_indices = torch.arange(CONTEXT_LENGTH)
pos_embeddings = pos_embedding_layer(position_indices)

print(f"Position indices: {position_indices}")
print(f"Positional embeddings shape: {pos_embeddings.shape}")
print(f"  → (sequence_length={CONTEXT_LENGTH}, embedding_dim={GPT2_EMBEDDING_DIM})")

# Uncomment to see the actual positional embedding values
# print("\nPositional embeddings:\n", pos_embeddings)

Position indices: tensor([0, 1, 2, 3])
Positional embeddings shape: torch.Size([4, 256])
  → (sequence_length=4, embedding_dim=256)


## Part 4: Combining Token and Positional Embeddings

The final input embeddings = token embeddings + positional embeddings

**Broadcasting in PyTorch:**
- `token_embeddings`: shape `[8, 4, 256]` (batch, sequence, embedding)
- `pos_embeddings`: shape `[4, 256]` (sequence, embedding)
- PyTorch automatically broadcasts pos_embeddings across the batch dimension
- Result: shape `[8, 4, 256]`

This means every sequence in the batch gets the same positional encoding added to it.

In [92]:
# Combine token and positional embeddings
input_embeddings = token_embeddings + pos_embeddings

print(f"Token embeddings shape:      {token_embeddings.shape}")
print(f"Positional embeddings shape: {pos_embeddings.shape}")
print(f"Input embeddings shape:      {input_embeddings.shape}")
print(f"\n✓ Broadcasting worked! Positional embeddings added to each sequence in the batch.")

# Uncomment to see the final input embeddings
# print("\nInput embeddings:\n", input_embeddings)

Token embeddings shape:      torch.Size([8, 4, 256])
Positional embeddings shape: torch.Size([4, 256])
Input embeddings shape:      torch.Size([8, 4, 256])

✓ Broadcasting worked! Positional embeddings added to each sequence in the batch.


## Summary

**What we learned:**

1. **Token Embeddings**: Convert discrete token IDs → continuous vectors
2. **Positional Embeddings**: Encode position information for each token
3. **Input Embeddings**: Token + Positional embeddings = what the model actually processes

**Key dimensions for GPT-2:**
- Vocabulary size: 50,257 tokens
- Embedding dimension: 256 (in this tutorial; GPT-2 uses 768)
- Context length: 4 tokens (GPT-2 uses 1024)

These input embeddings are now ready to be fed into the transformer's attention layers!