# Lab Project: Deconstructing the Transformer

Related video: https://www.youtube.com/watch?v=kCc8FmEb1nY

**"Attention is All You Need" â€“ From Scratch**

- **Duration:** 2 Weeks
   
- **Tools:** Python, PyTorch (recommended) or TensorFlow/JAX
   
- **Dataset:** ["Tiny Shakespeare"](https://huggingface.co/datasets/karpathy/tiny_shakespeare) (Character-level text generation)
   


## 1. Project Overview

The goal of this lab is not to use a pre-built library like Hugging Face `transformers` to fine-tune a model. Instead, you will implement the Transformer architecture **layer-by-layer** using basic tensor operations.

By the end of this assignment, you will have a working **Decoder-Only Transformer** (a mini-GPT) capable of generating Shakespearean-style text. You will understand the exact flow of gradients through Self-Attention, Layer Normalization, and Residual Connections.


In [8]:
import requests

## 2. The Dataset & Preprocessing

We will use the "Tiny Shakespeare" dataset. It is small, trains quickly on a CPU/low-end GPU, and allows for immediate visual verification (i.e., does the output look like English?).

**Task 0: Setup**

1. Download `input.txt` (Tiny Shakespeare).
   
2. Create a tokenizer: Build a dictionary mapping unique characters to integers (encoding) and integers back to characters (decoding).
   
3. Create a PyTorch `Dataset` or data loader that serves batches of context blocks (e.g., block size of 32 or 64 characters).


In [19]:
input_file = 'input.txt'
file_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

print(f'Downloading file {input_file} from {file_url}...')
with open(input_file,'w') as f:
    f.write(requests.get(file_url).text)

with open('input.txt', 'r', encoding='utf-8') as f:
    data = f.read()

Downloading file input.txt from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt...


In [23]:
print(f'Data length: {len(data)}')
    

Data length: 1115394


## 3. Implementation Milestones

You must implement the architecture in an object-oriented fashion. Do not use `torch.nn.Transformer` or `torch.nn.MultiheadAttention`. You must build these classes yourself using `torch.nn.Linear`, `torch.matmul`, etc.

### Part I: Positional Embeddings

Transformers process tokens in parallel, meaning they have no inherent sense of order. You must inject this information.

- **Requirement:** Implement the sinusoidal positional encoding as described in the original paper _Attention is All You Need_.
   
- Formula:
   
    $$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
   
    $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$
   
- **Deliverable:** A class `PositionalEncoding(d_model, max_len)` and a plot visualizing the embeddings (heatmap).

### Part II: Scaled Dot-Product Attention (The Core)

This is the mathematical engine of the Transformer.

- Requirement: Implement a function that takes Query ($Q$), Key ($K$), and Value ($V$) matrices and computes:
   
    $$Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
   
- **Critical Detail:** You must implement a **Mask**. Since this is a decoder-only model for text generation, the model cannot "see the future." You must apply a lower-triangular mask (setting upper values to $-\infty$) before the softmax so that position $t$ can only attend to positions $0$ through $t$.


### Part III: Multi-Head Attention (MHA)

Single-head attention captures one type of relationship. MHA allows the model to focus on different positions jointly from different representation subspaces.

- **Requirement:** Create a `MultiHeadAttention` class.
   
    1. Linear projections for $Q$, $K$, and $V$.
       
    2. Split the heads (reshape the tensors).
       
    3. Apply Scaled Dot-Product Attention (from Part II).
       
    4. Concatenate heads and apply a final linear projection.
       
- **Code Hint:** Be careful with tensor shapes.
   
    - Input: `(Batch, Time, Channels)`
       
    - Reshape to: `(Batch, Time, Heads, Head_Dim)`
       
    - Transpose for matmul: `(Batch, Heads, Time, Head_Dim)`


### Part IV: The Transformer Block

Assemble the components into a repeatable layer.

- **Requirement:** Create a `Block` class containing:
   
    1. **Layer Normalization:** Applied _before_ the sub-layer (Pre-Norm formulation is generally more stable than Post-Norm).
       
    2. **Multi-Head Attention:** Your class from Part III.
       
    3. **Feed-Forward Network:** A simple MLP expanding the dimension by 4x (e.g., `d_model` -> `4*d_model` -> `d_model`) with ReLU or GeLU activation.
       
    4. **Residual Connections:** $x + Sublayer(Norm(x))$.

## 4. Assembly and Training

### The Model

Create a `GPTLanguageModel` class that stacks:

1. Token Embeddings + Positional Encodings.
   
2. $N$ layers of your `Block` (Try $N=4$ to $6$).
   
3. Final Layer Norm.
   
4. Final Linear Head (projecting to vocabulary size).

### Training Loop

- **Hyperparameters:**
   
    - Batch size: 32 or 64
       
    - Block size (context length): 128 or 256
       
    - Embedding dimension ($d_{model}$): 384
       
    - Heads: 6
       
    - Learning Rate: 3e-4 (use AdamW optimizer)
       
- **Metric:** Calculate Cross Entropy Loss.


### Generation Function

Implement a `generate` function.

- Take a starting context (e.g., a single character).
   
- Pass through the model to get logits.
   
- Apply Softmax to get probabilities.
   
- Sample from the distribution (`torch.multinomial`).
   
- Append the new character and repeat.

### Analysis Questions (Include in Report)

1. **Scaling:** Why do we divide by $\sqrt{d_k}$ in the attention formula? What happens to the Softmax gradients if we don't?
   
2. **Positional Encoding:** Why do we add positional encodings to the embeddings rather than concatenating them?
   
3. **Complexity:** What is the Big-O time complexity of the Self-Attention mechanism with respect to the sequence length $T$? Why is this a problem for very long texts?

## 6. Starter Code Snippet (Helper)

Here is the signature for your Multi-Head Attention to get you started:

Python

```
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size, n_embd, dropout=0.2):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size * num_heads, bias=False)
        self.query = nn.Linear(n_embd, head_size * num_heads, bias=False)
        self.value = nn.Linear(n_embd, head_size * num_heads, bias=False)
        # You need to register the mask as a buffer so it's not treated as a parameter
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
       
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # B: Batch, T: Time (Sequence Length), C: Channels (Embed size)
        B, T, C = x.shape
       
        # Implementation goes here...
        pass
```