In [15]:
# executes all the code in .ipynb, making its variables, functions, and classes available in the current notebook
%run main.ipynb

PyTorch version: 2.5.1
CUDA available: False
✅ All tests passed successfully!
✅ All positional encoding tests passed successfully!
tensor([[1.0000e+01],
        [2.0869e+01],
        [5.5094e+01],
        [1.9833e+02],
        [1.0751e+03],
        [1.0000e+04],
        [1.8968e+05],
        [9.2131e+06],
        [1.5473e+09],
        [1.3358e+12],
        [1.0000e+16],
        [1.2946e+21],
        [7.2048e+27]])
✅ Output shape test passed!
✅ Attention weights shape test passed!
✅ Softmax test passed!
✅ Deterministic output test passed!
✅ Masking test passed!
🎉 All ScaledDotProductAttention tests passed!
✅ Output shape test passed!
✅ Tensor type test passed!
✅ Deterministic output test passed!
✅ Masking test passed!
🎉 All MultiHeadAttention tests passed!
✅ Output shape test passed!
✅ Tensor type test passed!
✅ ReLU activation test passed!
✅ Deterministic output test passed!
✅ Gradient computation test passed!
🎉 All PositionwiseFeedForward tests passed!


The `TokenEmbedding` class is essentially a simple wrapper around PyTorch's `nn.Embedding` layer. Let's break down its role in the Transformer model and develop some intuition for why it's needed.

---

### **Why Do We Need Token Embeddings?**
Transformers operate on continuous-valued vectors, but raw text is represented as discrete token indices. The `TokenEmbedding` layer maps each token index (a unique integer in a vocabulary) to a learnable dense vector of size `d_model`. These embeddings serve as input features to the Transformer.

---

### **Understanding the Components**
1. **`nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)`**
   - `num_embeddings=vocab_size`: This defines the number of unique tokens in the vocabulary.
   - `embedding_dim=d_model`: Each token is mapped to a `d_model`-dimensional dense vector.

2. **`forward(x)`**
   - The input `x` is a tensor of shape `(batch_size, seq_len)`, where each value in `x` is an integer representing a token index.
   - The `nn.Embedding` layer takes these indices and outputs a tensor of shape `(batch_size, seq_len, d_model)`, where each token index is replaced with its corresponding learned embedding vector.

---

### **Intuition**
Think of an embedding layer as a lookup table:
- Suppose `vocab_size = 10,000` and `d_model = 512`.
- The embedding layer maintains a weight matrix of shape `(10,000, 512)`, where each row represents the learnable embedding vector for a token.
- When we input a sequence of token indices (e.g., `[2, 45, 398]`), the layer extracts the corresponding rows from the embedding matrix and returns them as a tensor.

---

### **Example Usage**
```python
vocab_size = 10000  # Example vocab size
d_model = 512  # Embedding dimension

embedding_layer = TokenEmbedding(vocab_size, d_model)

# Example input: batch of 2 sequences of length 4
x = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])

embedded_x = embedding_layer(x)
print(embedded_x.shape)  # Expected output: (2, 4, 512)
```

---

### **Why Not One-Hot Encoding?**
Instead of learning embeddings, we could represent tokens using one-hot encoding. However, one-hot encoding has limitations:
- The vector size would be `vocab_size`, making it infeasible for large vocabularies.
- One-hot vectors don't capture semantic relationships between words (e.g., "king" and "queen" would be completely different vectors).

By learning embeddings, the model can discover useful relationships between words in a lower-dimensional space.

That’s a great idea! Playing around with `TokenEmbedding` in a Jupyter Notebook will give you a strong intuition about how embeddings transform discrete tokens into dense vector representations. Here’s a step-by-step approach to explore it interactively:

---

## **1. Setup**
First, ensure that PyTorch is installed:
```python
!pip install torch
```

Then, import necessary libraries:
```python
import torch
import torch.nn as nn
```

---

## **2. Define `TokenEmbedding`**
We’ll define the `TokenEmbedding` class as you originally wrote it:
```python
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        """
        Initializes the embedding layer.

        Args:
            vocab_size (int): Number of unique tokens in the vocabulary.
            d_model (int): Dimension of the embedding vectors.
        """
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)  

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass for token embedding.

        Args:
            x (torch.Tensor): Tensor of shape (batch_size, seq_len) containing token indices.

        Returns:
            torch.Tensor: Tensor of shape (batch_size, seq_len, d_model) containing embedded representations.
        """
        return self.embedding(x)
```

---

## **3. Initialize the Embedding Layer**
Set up an example vocabulary size and embedding dimension:
```python
vocab_size = 20  # Suppose our vocab has 20 words
d_model = 5      # Each token gets a 5-dimensional embedding

embedding_layer = TokenEmbedding(vocab_size, d_model)
```

---

## **4. Pass Sample Input**
Create a batch of tokenized sequences and feed them into the embedding layer:
```python
# Example tokenized sequences (batch_size=2, seq_len=4)
x = torch.tensor([[1, 5, 10, 15], [2, 6, 11, 16]])

# Pass through the embedding layer
embedded_x = embedding_layer(x)

# Print shapes
print(f"Input shape: {x.shape}")  # (batch_size, seq_len)
print(f"Output shape: {embedded_x.shape}")  # (batch_size, seq_len, d_model)

# View actual embeddings
print("\nEmbedded representations:\n", embedded_x)
```

---

## **5. Visualizing the Embeddings**
### **Checking the Embedding Matrix**
Since `nn.Embedding` maintains a lookup table of size `(vocab_size, d_model)`, we can directly inspect it:
```python
print("Embedding matrix (weight parameters):\n")
print(embedding_layer.embedding.weight)  # Shows the entire embedding matrix
```

Each row corresponds to a word in the vocabulary, and each column represents one of its embedding dimensions.

---

## **6. Experimenting with Different Inputs**
Try passing:
- A single sequence instead of a batch.
- A single token index.
- A random tensor with invalid values (to see the error).
```python
# Single sequence
single_seq = torch.tensor([3, 7, 14])
print("\nSingle sequence embedding:\n", embedding_layer(single_seq))

# Single token index
single_token = torch.tensor([4])
print("\nSingle token embedding:\n", embedding_layer(single_token))
```

---

### **What's Next?**
- Try increasing `d_model` to see how it affects embeddings.
- Use different token values to verify the lookup behavior.
- Add `PositionalEncoding` to see how positional information is incorporated.

Want to explore positional encodings next? 🚀

In [16]:
vocab_size = 20  # Suppose our vocab has 20 words
d_model = 5      # Each token gets a 5-dimensional embedding

embedding_layer = TokenEmbedding(vocab_size, d_model)
embedding_layer

TokenEmbedding(
  (embedding): Embedding(20, 5)
)

In [17]:
# Example tokenized sequences (batch_size=2, seq_len=4)
x = torch.tensor([[1, 5, 10, 15], [2, 6, 11, 16]])

# Pass through the embedding layer
embedded_x = embedding_layer(x)

# Print shapes
print(f"Input shape: {x.shape}")  # (batch_size, seq_len)
print(f"Output shape: {embedded_x.shape}")  # (batch_size, seq_len, d_model)

# View actual embeddings
print("\nEmbedded representations:\n", embedded_x)


Input shape: torch.Size([2, 4])
Output shape: torch.Size([2, 4, 5])

Embedded representations:
 tensor([[[ 0.5315, -0.8051,  0.0987,  0.8236, -1.0337],
         [ 0.2961, -1.8903,  0.3494,  0.5855, -0.2111],
         [-3.0105,  1.2251, -0.9130,  0.6408, -0.3248],
         [-0.0534, -0.8222,  0.1004,  1.3967,  1.2444]],

        [[ 2.3621, -0.4427,  1.5551,  2.0475,  0.0160],
         [ 0.6351, -2.1685, -0.9456,  1.4489, -0.2879],
         [-0.5526,  1.2502, -0.8365, -0.5334,  0.3436],
         [ 0.6687, -0.9704, -0.7501, -0.0662, -0.3389]]],
       grad_fn=<EmbeddingBackward0>)


In [18]:
print("Embedding matrix (weight parameters):\n")
print(embedding_layer.embedding.weight)  # Shows the entire embedding matrix


Embedding matrix (weight parameters):

Parameter containing:
tensor([[-0.1952, -0.5487, -0.0722,  0.2438, -1.1966],
        [ 0.5315, -0.8051,  0.0987,  0.8236, -1.0337],
        [ 2.3621, -0.4427,  1.5551,  2.0475,  0.0160],
        [ 0.4441, -0.1668, -0.8881,  0.6682, -0.3665],
        [-0.1184, -0.8980,  0.8748, -0.0442, -2.3425],
        [ 0.2961, -1.8903,  0.3494,  0.5855, -0.2111],
        [ 0.6351, -2.1685, -0.9456,  1.4489, -0.2879],
        [-0.8877, -1.0203, -0.0877, -0.0692,  0.7135],
        [-0.4515, -1.2082,  0.7028,  0.0785,  1.3558],
        [-0.0853,  1.0107,  0.4004, -0.9564,  0.7705],
        [-3.0105,  1.2251, -0.9130,  0.6408, -0.3248],
        [-0.5526,  1.2502, -0.8365, -0.5334,  0.3436],
        [-0.3987,  0.0290, -1.8351, -1.4687,  0.2974],
        [-0.4602, -0.8119,  0.2518, -0.3838,  0.1372],
        [-0.0932,  0.3097, -1.1411,  1.1009, -0.7020],
        [-0.0534, -0.8222,  0.1004,  1.3967,  1.2444],
        [ 0.6687, -0.9704, -0.7501, -0.0662, -0.3389],
    

In [19]:
# Single sequence
single_seq = torch.tensor([3, 7, 14])
print("\nSingle sequence embedding:\n", embedding_layer(single_seq))

# Single token index
single_token = torch.tensor([4])
print("\nSingle token embedding:\n", embedding_layer(single_token))



Single sequence embedding:
 tensor([[ 0.4441, -0.1668, -0.8881,  0.6682, -0.3665],
        [-0.8877, -1.0203, -0.0877, -0.0692,  0.7135],
        [-0.0932,  0.3097, -1.1411,  1.1009, -0.7020]],
       grad_fn=<EmbeddingBackward0>)

Single token embedding:
 tensor([[-0.1184, -0.8980,  0.8748, -0.0442, -2.3425]],
       grad_fn=<EmbeddingBackward0>)
