# Lesson 5: Building the Language Model

Welcome to Lesson 5! In this lesson, we’re going to put together everything you’ve learned so far to build a complete language model. By the end of this lesson, you’ll understand how to assemble multiple transformer blocks into a functioning model that can process and generate text.

## Recap of Previous Lessons
- **Lesson 1**: Introduction to LLMs and tokenization.
- **Lesson 2**: Embeddings and how they represent tokens.
- **Lesson 3**: Single-head attention mechanism.
- **Lesson 4**: Multi-head attention and transformer blocks.

Now, we’ll take those transformer blocks and stack them to create a language model capable of understanding and generating text, using our familiar Chinese example: *法國紅酒慢煮阿根廷牛舌 配 煙肉洋蔥炒著仔*.

## What You’ll Learn
- Why we need positional encoding and how it works.
- How to stack multiple transformer blocks to build a deeper model.
- How to add an output layer for generating predictions.
- The basics of training a language model.
- A practical example using mock data inspired by our Chinese text.

Let’s dive in!

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

## Understanding Positional Encoding

### Why Do We Need Positional Encoding?
Imagine you’re reading *法國紅酒慢煮阿根廷牛舌*. The order of words matters: “France red wine slow-cook Argentina beef tongue” makes sense, but “beef tongue Argentina slow-cook red wine France” is confusing! Transformers, unlike older models like RNNs, process all tokens at once (in parallel), not sequentially. This speed is great, but it means they don’t naturally understand word order. **Positional encoding** fixes this by adding information about each token’s position in the sequence.

### How Does It Work?
Positional encoding adds a unique “fingerprint” to each token’s embedding based on its position (e.g., 1st, 2nd, 3rd). In our code, we use a mathematical trick with **sine and cosine functions**:
- For each position (e.g., 0 for `法國`, 1 for `紅酒`), we create a vector of the same size as the token embedding (e.g., 64 numbers).
- The vector’s values alternate between sine and cosine, with frequencies that change based on the position and dimension. This creates a unique pattern for every spot in the sequence.
- We add this positional vector to the token’s embedding, so the model knows both *what* the token is (from the embedding) and *where* it is (from the positional encoding).

### Example
- Token `法國` at position 0 might get embedding `[0.1, -0.2, 0.5, ...]`.
- Positional encoding for position 0 could be `[0.0, 1.0, 0.0, ...]` (simplified).
- Combined: `[0.1, 0.8, 0.5, ...]`—now it’s marked as the first word.
- `紅酒` at position 1 gets a different positional vector, like `[0.84, 0.54, ...]`, shifting its embedding.

### Why Sine and Cosine?
- These functions create smooth, repeating patterns that the model can learn.
- They work for any sequence length (up to a max) because the pattern scales predictably.
- The alternating frequencies (high for early dimensions, low for later ones) let the model distinguish nearby vs. far-away positions.

In short, positional encoding is like giving each word a seat number at a table—it helps the transformer know who’s sitting where!

In [2]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_seq_length):
        super().__init__()
        # Create a fixed positional encoding matrix
        pe = torch.zeros(max_seq_length, embed_dim)
        position = torch.arange(0, max_seq_length).unsqueeze(1)  # [max_seq_length, 1]
        div_term = torch.exp(torch.arange(0, embed_dim, 2) * -(np.log(10000.0) / embed_dim))
        pe[:, 0::2] = torch.sin(position * div_term)  # Even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd indices
        self.register_buffer('pe', pe)  # Store it as a buffer (not trainable)

    def forward(self, x):
        # Add positional encoding to the input embeddings
        seq_len = x.size(1)
        x = x + self.pe[:seq_len, :]
        return x

In [3]:
# Transformer components from Lesson 4, simplified with PyTorch's built-in attention
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)

    def forward(self, x):
        x = x.transpose(0, 1)  # [seq_len, batch_size, embed_dim]
        attn_output, attn_weights = self.attention(x, x, x)
        attn_output = attn_output.transpose(0, 1)  # [batch_size, seq_len, embed_dim]
        return attn_output, attn_weights

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_hidden_dim, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_hidden_dim),
            nn.ReLU(),
            nn.Linear(ff_hidden_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        attn_output, attn_weights = self.attention(x)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x, attn_weights

In [4]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, num_layers, ff_hidden_dim, max_seq_length, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.positional_encoding = PositionalEncoding(embed_dim, max_seq_length)
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, ff_hidden_dim, dropout)
            for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)  # [batch_size, seq_length] -> [batch_size, seq_length, embed_dim]
        x = self.positional_encoding(x)
        for block in self.transformer_blocks:
            x, _ = block(x)  # Pass through each transformer block
        logits = self.output_layer(x)  # [batch_size, seq_length, vocab_size]
        return logits

## Is This Like LLaMA 2, Just Smaller?

You might wonder if this model is a mini version of LLaMA 2. The short answer is **yes, but with simplifications**. Let’s compare:

### Similarities to LLaMA 2
- **Transformer-Based**: Both use stacked transformer blocks with multi-head attention, feed-forward networks, and layer normalization.
- **Positional Encoding**: LLaMA 2 uses a form of positional encoding (though it’s a variant called RoPE—Rotary Position Embedding), while we use the classic sine/cosine method.
- **Decoder-Only**: Our model is a decoder-only architecture (like LLaMA 2), meaning it’s designed to predict the next token based on previous ones, perfect for generating text.
- **Scalability**: The structure (embedding → positional encoding → transformer blocks → output) mirrors LLaMA 2’s core design, just with fewer layers and smaller sizes.

### Differences from LLaMA 2
- **Size**: LLaMA 2 has models ranging from 7B to 70B parameters (e.g., billions of weights), with `embed_dim` around 4096, 32+ layers, and 32+ heads. Ours is tiny—e.g., `embed_dim=64`, 2 layers, 4 heads, with maybe a few million parameters at most.
- **Positional Encoding**: LLaMA 2 uses RoPE, which rotates embeddings based on position (more efficient for long sequences), while we use the original Transformer’s sine/cosine method.
- **Optimization**: LLaMA 2 includes advanced tricks like grouped-query attention (fewer keys/values per head) and RMSNorm (a variant of LayerNorm). We stick to basic multi-head attention and standard LayerNorm for simplicity.
- **Training Data**: LLaMA 2 was trained on massive, diverse datasets (trillions of tokens). Ours is a toy model with mock data.
- **Efficiency**: LLaMA 2 uses techniques like FlashAttention and mixed precision for speed. We use PyTorch’s built-in attention, which is slower but easier to understand.

### Verdict
This is a **scaled-down, simplified cousin** of LLaMA 2. It captures the core ideas—transformer blocks, attention, and next-token prediction—but skips the advanced optimizations and massive scale that make LLaMA 2 a state-of-the-art model. Think of ours as a learning tool: it’s the same family, just small enough to fit in your hands!

## Input and Output

### Input
- A batch of token indices (e.g., `[0, 1, 2]` for `法國`, `紅酒`, `慢煮`).
- Shape: `[batch_size, seq_length]`.

### Processing
- **Embedding**: Turns indices into vectors (e.g., `[batch_size, seq_length, embed_dim]`).
- **Positional Encoding**: Adds position info.
- **Transformer Blocks**: Update token representations with attention and feed-forward layers.
- **Output Layer**: Produces logits for each token’s next prediction.

### Output
- Logits: Scores for each possible next token (e.g., `[batch_size, seq_length, vocab_size]`).
- For training, we predict the next token: input `法國紅酒慢煮` → target `紅酒慢煮阿根廷`.

## Training the Language Model

Training teaches the model to predict the next token accurately. Here’s how:

- **Loss Function**: Cross-entropy loss compares the model’s predictions (logits) to the true next tokens.
- **Optimizer**: Adam adjusts the model’s weights to reduce the loss.
- **Training Loop**: Repeatedly process batches of data, calculate loss, and update weights.

Sample training loop (conceptual):
```python
model = LanguageModel(...)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
    for inputs, targets in data_loader:
        optimizer.zero_grad()
        logits = model(inputs)
        loss = criterion(logits.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
```

This loop refines the model over time, making it better at guessing what comes next!

In [5]:
# Practical Example with Mock Data

# Step 1: Mock Data (simulating tokenized Chinese text)
vocab_size = 1000  # Imagine 1000 unique tokens like 法國, 紅酒, etc.
seq_length = 10    # Short sequence length
batch_size = 2     # Two example sentences

# Random token indices (pretend these are from our sentence)
inputs = torch.randint(0, vocab_size, (batch_size, seq_length))
targets = torch.randint(0, vocab_size, (batch_size, seq_length))  # Shifted for next-token prediction

# Step 2: Define the Model
embed_dim = 64
num_heads = 4
num_layers = 2
ff_hidden_dim = 256
max_seq_length = 50

model = LanguageModel(vocab_size, embed_dim, num_heads, num_layers, ff_hidden_dim, max_seq_length)

# Step 3: Forward Pass
logits = model(inputs)
print("Logits shape:", logits.shape)  # [2, 10, 1000]

# Note: In a real setup, you’d use a tokenizer on 'data.csv' and train with actual data.

Logits shape: torch.Size([2, 10, 1000])


## Exercises

1. **Change Layers**: Set `num_layers` to 1 or 4. Does the output shape change? Why or why not?
2. **Adjust Embedding Size**: Try `embed_dim = 32` or `128`. Run the model and check the logits shape.
3. **Visualize Positional Encoding**: Create a small `PositionalEncoding` instance (e.g., `embed_dim=4`, `max_seq_length=5`) and print `pe` to see the patterns.
4. **Attention Weights**: Modify `LanguageModel.forward` to return attention weights from the last block. Plot them for one input sequence.

Experiment below!

In [None]:
# Your exercise code here



## Summary

Great job! You’ve built a language model from scratch. Here’s what we covered:
- **Positional Encoding**: Adds order to tokens using sine/cosine patterns.
- **Stacking Blocks**: Combined transformer blocks for deeper understanding.
- **Output Layer**: Predicts the next token with logits.
- **Training Basics**: Learned how loss and optimization work.
- **Comparison to LLaMA 2**: Saw how our model is a simplified version of advanced LLMs.

Next, in Lesson 6, we’ll fine-tune this model and generate text with it. You’re almost ready to create your own mini-LLaMA!