# LLM From Scratch

This is a notebook I'm using to re-create the GPT-2 style architecture from the book "Build a Large Language Model (From Scratch)."
I'm trying to do as much as possible from memory, other than having some notes on what classes and methods to implement.

**Required classes:**
1. LayerNorm
2. GELU
3. FeedForward
4. MultiHeadAttention
5. TransformerBlock
6. GPT_CONFIG_124M
7. GPTModel

In [2]:
# Import torch and nn.Module for class definitions
import torch
import torch.nn as nn

## 1. LayerNorm

This class is responsible for layer normalization, which takes place _multiple times_ in the GPT architecture.
Its purpose is to keep gradient magnitudes within a certain range, to avoid the problems of vanishing gradients and exploding gradients.
The concrete goal is to adjust the outputs to have a mean of zero and a variance of one.

To accomplish this, we need two values:
- the mean: µ = (x_1 + x_2 + ... + x_n) / n
- the variance: v = [(x_1 + µ) + (x_2 + µ) + ... + (x_n + µ)] / µ

The normalized vector is then: [(x_1 - µ)/√v, (x_2 - µ)/√v, ..., (x_n - µ)/√v]

NOTE: we're dividing by both n and v and we need to make sure we never divide by zero. We know that n (the embedding dimension) will never be zero, but the variance could be. For that reason, we add a miniscule value epsilon to the variance.

In [1]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim: int):
        super().__init__()
        self.emb_dim = emb_dim
        self.epsilon = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        mean = x.mean(dim=-1, keepdim=True)
        variance = x.var(dim=-1, keepdim=True, unbiased=False) + self.epsilon
        norm = (x - mean) / torch.sqrt(variance)
        return self.scale * norm + self.shift

## 2. GELU

GELU, or Gaussian Error Linear Unit, is the activation function we'll be using. It's similar to RELU, but it's differentiable everywhere (even at zero, where RELU has a sharp corner discontinuity). GELU is also slightly negative between -2 and 0, rather than flatly zero like RELU. This provides a richer range of values for the network to train on.

Calculating the GELU for real would take us out of closed-form math, so we'll use a very close approximation here instead.

In [None]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

## 3. FeedForward
To be implemented.

In [None]:
class FeedForward:
    def __init__(self):
        pass

## 4. MultiHeadAttention
To be implemented.

In [None]:
class MultiHeadAttention:
    def __init__(self):
        pass

## 5. TransformerBlock
To be implemented.

In [None]:
class TransformerBlock:
    def __init__(self):
        pass

## 6. GPT_CONFIG_124M
The configuration paramters for our GPT-2 implementation. These come directly from the book.

In [None]:
from typing import TypedDict

class GPTConfigDict(TypedDict):
    vocab_size: int        # the number of tokens in the vocabulary
    context_length: int    # the maximum number of token vectors to consider at once
    emb_dim: int           # the width of the token vectors
    n_heads: int           # the number of heads to use for multi-head attention
    n_layers: int          # the number of transformer layers to use
    drop_rate: float       # the dropout percentage rate
    qkv_bias: bool         # whether to use the bias setting for the KQV matrices.

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

## 7. GPTModel
Top-level GPT-2 model class.

In [None]:
class GPTModel(nn.Module):
    """
    Top-level GPT-2 model.
    """
    # cfg: Cfg -> GPTModel
    def __init__(self, cfg: GPTConfigDict):
        """Initialize model with config."""
        super().__init__()
        pass

    # in_idx: torch.Tensor -> logits: torch.Tensor
    def forward(self, in_idx: torch.Tensor) -> torch.Tensor:
        """Forward pass: input indices to logits."""
        pass