# LLM From Scratch

This is a notebook I'm using to re-create the GPT-2 style architecture from the book "Build a Large Language Model (From Scratch)."
I'm trying to do as much as possible from memory, other than having some notes on what classes and methods to implement.

**Required classes:**
1. `LayerNorm`
2. `GELU`
3. `GPT_CONFIG_124M`
4. `FeedForward`
5. `MultiHeadAttention`
6. `TransformerBlock`
7. `GPTModel`

In [1]:
# Import torch and nn.Module for class definitions
import torch
import torch.nn as nn

## 1. LayerNorm

This class is responsible for layer normalization, which takes place _multiple times_ in the GPT architecture.
Its purpose is to keep gradient magnitudes within a certain range, to avoid the problems of vanishing gradients and exploding gradients.
The concrete goal is to adjust the outputs to have a mean of zero and a variance of one.

To accomplish this, we need two values:
- the mean: $\mu = \frac{(x_1 + x_2 + ... + x_n)}{n}$
- the variance: $v = \frac{(x_1 + \mu)^2 + (x_2 + \mu)^2 + ... + (x_n + \mu)^2}{n} + \epsilon$

The normalized vector is then: $[\frac{(x_1 - µ)}{\sqrt{v}}, \frac{(x_2 - µ)}{\sqrt{v}}, ..., \frac{(x_n - µ)}{\sqrt{v}}]$

NOTE: we're dividing by both n and $\sqrt{v}$ and we need to make sure we never divide by zero. We know that n (the embedding dimension) will never be zero, but the variance could be. For that reason, we add a miniscule value epsilon to the variance.

In [2]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim: int):
        super().__init__()
        self.emb_dim = emb_dim
        self.epsilon = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        mean = x.mean(dim=-1, keepdim=True)
        variance = x.var(dim=-1, keepdim=True, unbiased=False) + self.epsilon
        norm = (x - mean) / torch.sqrt(variance)
        return self.scale * norm + self.shift

## 2. GELU

GELU, or Gaussian Error Linear Unit, is the activation function we'll be using. It's similar to RELU, but it's differentiable everywhere (even at zero, where RELU has a sharp corner discontinuity). GELU is also slightly negative between -2 and 0, rather than flatly zero like RELU. This provides a richer range of values for the network to train on.

Calculating the GELU for real would take us out of closed-form math, so we'll use a very close approximation here instead.

In [3]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x * 0.5 * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

## 3. GPT_CONFIG_124M
The configuration paramters for our GPT-2 implementation. These come directly from the book.

In [4]:
from typing import TypedDict

class GPTConfigDict(TypedDict):
    vocab_size: int        # the number of tokens in the vocabulary
    context_length: int    # the maximum number of token vectors to consider at once
    emb_dim: int           # the width of the token vectors
    n_heads: int           # the number of heads to use for multi-head attention
    n_layers: int          # the number of transformer layers to use
    drop_rate: float       # the dropout percentage rate
    qkv_bias: bool         # whether to use the bias setting for the KQV matrices.

GPT_CONFIG_124M: GPTConfigDict = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False,
}

## 4. FeedForward

The feed-forward network (or multi-layer perceptron) is the fundamental neural network used in the GPT model.
It expands the number of outputs in a hidden layer before shrinking back down to the original size for the output.
This allows the network to explore a richer space, while preserving the input and output dimensions to keep the overall architecture simple.

In this example, we'll expand the dimensions by a factor of 4 for the internal layer. I would normally say that should be configurable, but the book just has it fixed at 4. Anyway, that means that our 768 parameters will expand to 3,072, then shrink back down to 768 for output.

### How many layers?

If you look at a diagram of a feed-forward network, you'll see three layers:
1. a left-most layer with n weights
2. a middle layer with n*4 weights (or some other factor)
3. a right-most layer with n weights again.

However, if you look at the implementation below, it kind of seems like there are two linear layers.
Well, as you might guess, the middle layer is really the connection between the first and the second layers.
The first layer has `dim_internal` outputs, and the second layer has `dim_internal` inputs. These represent overlapping,
connected points—just as you might see in the diagram.

You could think about like this: each `nn.Linear` has two sides, and of the four total sides there are two that overlap in the center. Thus you get three layers!

In [5]:
class FeedForward(nn.Module):
    def __init__(self, cfg: GPTConfigDict): 
        super().__init__()
        expansion_factor = 4
        dim_external = cfg["emb_dim"]
        dim_internal = expansion_factor * dim_external
        self.layers = nn.Sequential(
            nn.Linear(dim_external, dim_internal),
            GELU(),
            nn.Linear(dim_internal, dim_external),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)

## 5. MultiHeadAttention

This is the heart of what makes GPT different to earlier language models. The attention mechanism tweaks context vectors in response to earlier tokens in the sequence, shifting their "meaning" to become much richer and more specific than a single word could be.

### Motivating Examples

For example, take the sentence "the cat sat on the mat because it was warm." The word "it" has one particular vector embedding in the vocabulary, which might relate loosely to concepts like "noun" and "non-human." That's not enough to capture the meaning of "it" in this sentence, where it most likely refers to "mat." Attention allows the system to change the vector for the "it" token to resemble the vector for "mat," clarifying its meaning in the context of the sentence.

That's about the simplest possible example, but in reality each token is pushed and pulled in much more subtle ways by many more tokens in the sequence, so that by the end it somehow represents the meaning of the entire sequence of text. Ultimately, the attention-modulated vector of the final token in the sequence is _the only input needed_ to predict the next token. That's pretty wild.

For a more contrived example of what this means, take another example sequence: "This gritty, romantic, windswept, ornate, melancholic city is none other than". The word "than" has nothing to do with any particular city or place, but by the time its vector is modulated by this long series of words preceding it, it will be something that appears close (in embedding space) to cities like Lisbon and Istanbul. Indeed, those are the two most likely predictions for the final word in the sequence from GPT-3.

### Implementation

Multi-head attention was first described in "Attention is All You Need" (2017), in sections 3.2.1 (scaled dot-product attention) and 3.2.2 (extending to multiple heads). I'll be using that paper as a reference for the following two sections.

#### Scaled Dot-Product Attention

Each attention head is an instance of something called "scaled dot-product attention," which is given by:

$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

That is, the attention weights given matrices K, Q, and V are the result of applying softmax to the product of Q times K-transpose over the square root of the embedding size of K, all multiplied by V.

I'll try to break that down a bit more:
- Q, K, and V are trainable matrix parameters with the same dimensions as the token embedding vectors. They are short for Query, Key, and Value.
  - I think of the Query parameter as representing what a token is "looking for" to know if another token is worth attending to.
  - To continue that metaphor, the Key parameter is what other tokens "look like" to the Query.
  - The Value is the real identity of the tokens that are found, their deeper reality beneath the appearance presented by the Key.
  - To sum up, a token's Query is used to examine every other token's Key to see if it's a good match. If it is, we use that token's Value in attention weight.
- Multiplying Q by the transpose of K gives us the dot product of every Query row against every Key row. In other words, it tells us how aligned every Query is with every Key.
- We scale that by the inverse square root of the Key dimensions to counteract a known issue with dot-product attention: "for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients." ("Attention is All You Need," p. 4). In other words, the dot product of two rows is going to tend to get larger the more columns you have, and these large values make it hard for training to adjust weights effectively. Scaling by the square root of the number of columns helps to solve this.
- Applying softmax turns these scaled dot products into weights.
- Multiplying by V translates the weights by Key into weights by Value.

Note: it's not described in detail in the paper, but there's an important step carried out here called masking. Essentially, we only want Queries to find Keys that _precede_ them in the sequence. We accomplish this by zeroing out values above the main diagonal. To make sure that these values are zero _after_ softmax, we first set them to minus-infinity.

#### Multi-Head Attention

In single-headed dot-product attention, Q, K, and V all have the same dimensions as the input and output embeddings. To use multiple heads, we divide the width of each parameter by the number of heads and concatenate them together. This results in the same overall dimensions, but with different sets of columns relating to different Value vectors:

$\text{MultiHead}(Q, K, V) = \text{Concat}(head_1, ..., head_h)W^O$

$\text{ where } head_i = \text{Attention}(Q_iW_i^Q, K_iW_i^K, V_iW_i^V)$

$\text{ where } W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $ W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, $W_i^O \in \mathbb{R}^{hd_{model} \times d_{model}}$


In [6]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in: int, d_out: int, context_length: int, dropout: float, num_heads: int, qkv_bias: bool=False):
        super().__init__()
        if d_out % num_heads != 0:
            raise ValueError("The number of heads must evenly divide d_out.")
        self.d_in = d_in
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_width = d_out // num_heads
        self.qkv_bias = qkv_bias

        # construct the weights for Q, K, and V.
        # these will be registered as trainable parameters automatically.
        self.w_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.w_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.w_value = nn.Linear(d_in, d_out, bias=qkv_bias)

        # and the output projection, also trainable.
        self.w_out = nn.Linear(d_out, d_out)
        
        # and the dropout layer. not trainable, just drops random values
        # to zero with a probability determined by the dropout parameter
        self.dropout = nn.Dropout(dropout)

        # and the mask, which prevents each token from "seeing" later ones
        mask = torch.triu( # an upper triangular matrix
            torch.ones(context_length, context_length), # consisting of ones
            diagonal=1, # starting one row above the diagonal, leaving the diagonal itself as zeroes.
        )
        self.register_buffer("mask", mask) # register this tensor as non-trainable, but keep it on the same device
        self.mask: torch.Tensor # to make the type-checker happy

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, num_tokens, d_in = x.shape
        queries = self.w_query(x)
        keys = self.w_key(x)
        values = self.w_value(x)

        # Split the last dimension of the tensors into multiple heads
        q_heads = queries.view(batch, num_tokens, self.num_heads, self.head_width)
        k_heads = keys.view(batch, num_tokens, self.num_heads, self.head_width)
        v_heads = values.view(batch, num_tokens, self.num_heads, self.head_width)

        #                                  [  0  ,     1     ,    2     ,      3    ]
        # {q,k,v}_heads now have the shape [batch, num_tokens, num_heads, head_width],
        # but we want them to be:          [batch, num_heads, num_tokens, head_width]
        q_heads = q_heads.transpose(1, 2)
        k_heads = k_heads.transpose(1, 2)
        v_heads = v_heads.transpose(1, 2)

        # now we need to calculate the raw dot-product attention scores between Q and K^T,
        # where K^T has the shape [batch, num_heads, head_width, num_tokens].
        # that gives attention_scores the shape [batch, num_heads, num_tokens, num_tokens]
        attention_scores = q_heads @ k_heads.transpose(2, 3)
        # and apply the causal mask
        mask = self.mask[:num_tokens, :num_tokens]
        attention_scores = attention_scores.masked_fill(mask == 1, float('-inf'))

        # and we construct the weights using softmax on the scaled final dimension
        attention_weights = torch.softmax(attention_scores / self.head_width**0.5, dim=-1)
        # and apply dropout
        attention_weights = self.dropout(attention_weights)

        #                                 [  0  ,     1    ,     2     ,     3     ]
        # attention_weights has the shape [batch, num_heads, num_tokens, num_tokens]
        # v_heads has the shape:          [batch, num_heads, num_tokens, head_width]
        # if we multiply them, we get:    [batch, num_heads, num_tokens, head_width]
        # but in the end, we want:        [batch, num_tokens, d_out]
        context = attention_weights @ v_heads # [batch, num_heads, num_tokens, head_width]

        # so we need to first transpose and get [batch, num_tokens, num_heads, head_width]
        context = context.transpose(1, 2)
        # and then concatenate the last two dimensions together to get d_out
        context = context.contiguous().view(batch, num_tokens, self.d_out)
        # and multiply by the output projection
        return self.w_out(context)

## 6. TransformerBlock

This version of the transformer block is loosely based on "Attention is All You Need" section 3, but includes _only_ the decoder stack. The encoder stack is omitted from the GPT architecture, and thus from the Build a Large Language Model (From Scratch) book.

The transformer block goes a little something like this:
```
Tokenized Text -> LayerNorm 1 -> MultiHeadAttention -> Dropout -> (+) -> LayerNorm 2 -> FeedForward -> Dropout -> (+) -> Output
```

Where `(+)` represents a shortcut connection, where a previous state is added back in to reinforce weights that are getting very small.

As far as requirements:
- I've already implemented the the LayerNorm, MultiHeadAttention, and FeedForward classes.
- `nn.Dropout` is provided by PyTorch.
- Shortcut connections just use ordinary variables and addition.

So we're all set to put these elements together below.

In [7]:
class TransformerBlock(nn.Module):
    """
    A single GPT-2 transformer block.
    """
    def __init__(self, cfg: GPTConfigDict):
        super().__init__()
        self.layer_norm_1 = LayerNorm(cfg["emb_dim"])
        self.attention = MultiHeadAttention(
            cfg["emb_dim"],
            cfg["emb_dim"],
            cfg["context_length"],
            cfg["drop_rate"],
            cfg["n_heads"],
            cfg["qkv_bias"],
        )
        self.drop_rate = cfg["drop_rate"]
        self.layer_norm_2 = LayerNorm(cfg["emb_dim"])
        self.feedforward = FeedForward(cfg)
        self.dropout = nn.Dropout(self.drop_rate)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        shortcut = x
        x = self.layer_norm_1(x)
        x = self.attention(x)
        x = self.dropout(x)
        x = x + shortcut

        shortcut = x
        x = self.layer_norm_2(x)
        x = self.feedforward(x)
        x = self.dropout(x)
        x = x + shortcut
        return x

## 7. GPTModel

This is the big one, where everything comes together.
The hard parts are all pretty much done, this is going to be just a bit more glue.

The flow here goes like:
```
Tokenized Text -> Token Embedding Layer -> Positional Embedding Layer -> Dropout -> TransformerBlocks -> LayerNorm -> Output
```

Or, in detail:
1. Tokenized Text: the tokenizer is outside of this module; we'll get to that later.
2. Token Embedding Layer: this is a trainable `nn.Embedding` layer that starts out with random weights. It maps tokens to the embedding space.
3. Positional Embedding Layer: very similar to the Token Embedding Layer, but encodes positional information rather than "semantic" content.
4. Dropout: provided by `nn.Dropout` with a configurable drop rate.
5. TransformerBlocks: implemented above. We'll have a number of these set by config, and they run in serial.
6. LayerNorm: also implemented above. This keeps all values in the tensors in a range of [-1, 1], with a mean of 0.
7. Output: the outputs are called "logits," and they represent the likelihood that the following token will be the one with a given ID. In order to project these from the previous LayerNorm, we'll need the size to be $\text{emb\_dim} \times \text{vocab\_size}$

In [8]:
class GPTModel(nn.Module):
    """
    Top-level GPT-2 model.
    """
    def __init__(self, cfg: GPTConfigDict):
        """Initialize model with config."""
        super().__init__()
        self.token_embedding = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.positional_embedding = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.dropout = nn.Dropout(cfg["drop_rate"])
        self.transformer_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
        self.layer_norm = LayerNorm(cfg["emb_dim"])
        self.output = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx: torch.Tensor) -> torch.Tensor:
        """Forward pass: input indices to logits."""
        batch_size, sequence_length = in_idx.shape
        token_embeddings = self.token_embedding(in_idx)
        positional_embeddings = self.positional_embedding(
            # get the first N positional embeddings, where N is the sequence length
            torch.arange(sequence_length, device=in_idx.device)
        )

        x = token_embeddings + positional_embeddings
        x = self.dropout(x)
        x = self.transformer_blocks(x)
        x = self.layer_norm(x)
        logits = self.output(x)
        return logits

# Smoke Test

If everything above has worked, then we should be able to exactly replicate the results from the book as long as we use the same seed (123).

Use the `smoke_test` function below to get the predicted completion for a given prompt from the _untrained_ LLM.

Note: because the LLM is still untrained, the result will be total garbage.

In [9]:
import tiktoken

def generate_text_simple(model, idx, max_new_tokens, context_size):
    """
    A helper function used by smoke_test. It's easier to pass the prompt to smoke_test, rather than call this directly.
    """
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]
        probabilities = torch.softmax(logits, dim=-1)
        idx_next = torch.argmax(probabilities, dim=-1, keepdim=True)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

def smoke_test(prompt):
    """
    Pass the prompt to the (untrained) GPT model with a manual seed. Should correspond to the expected output.
    """
    torch.manual_seed(123)
    tokenizer = tiktoken.get_encoding("gpt2")
    model = GPTModel(GPT_CONFIG_124M)
    encoded = tokenizer.encode(prompt)
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    model.eval()
    out = generate_text_simple(
        model,
        encoded_tensor,
        6,
        GPT_CONFIG_124M["context_length"]
    )
    decoded_text = tokenizer.decode(out.squeeze(0).tolist())
    print(decoded_text)

smoke_test("Hello, I am") # should output "Hello, I am Featureiman Byeswickattribute argue"

Hello, I am Featureiman Byeswickattribute argue


# Training a smaller GPT-2

What follows is the code to train a version of this architecture. Because training is computationally expensive, I'm going to reduce the context length to a more manageable 256.

In [10]:
GPT_CONFIG_MINI: GPTConfigDict = {**GPT_CONFIG_124M, "context_length": 256} # 1024 is just too big to train locally

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_MINI)
model.eval(); # disables dropout

## Convenience Functions and Example

We're adding a few functions to make it easier to interact with the model. These might've been useful in the smoke test above, so maybe I'll refactor a bit later.

Also, another quick example of how they work.

In [11]:
def text_to_token_ids(text: str, tokenizer: tiktoken.Encoding) -> torch.Tensor:
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids: torch.Tensor, tokenizer: tiktoken.Encoding) -> str:
    flat = token_ids.squeeze(0) # remove batch dimension
    return tokenizer.decode(flat.tolist())

# example:
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_MINI["context_length"],
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


# Calculating Loss

This is a temporary helper to show the training and validation loss scores for a given corpus. It still only uses the untrained model, so the results are guaranteed to be garbage.

In [12]:
from torch.utils.data import Dataset, DataLoader
import time

class GPTDatasetV1(Dataset):
    def __init__(self, text: str, tokenizer: tiktoken.Encoding, max_length: int, stride: int):
        self.input_ids = []
        self.target_ids = []
        
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            start = i
            end = start + max_length
            input_chunk = token_ids[start:end]
            target_chunk = token_ids[start+1:end+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self) -> int:
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

class LossCalculator:
    def __init__(self, cfg: GPTConfigDict, train_ratio:float=0.9):
        self.cfg = cfg
        self.model = GPTModel(self.cfg)
        self.tokenizer = tiktoken.get_encoding("gpt2")
        self.train_ratio = train_ratio
        self.device = self.get_device()

    def get_device(self) -> torch.device:
        if torch.cuda.is_available(): # type: ignore[attr-defined]
            return torch.device("cuda")
        elif torch.backends.mps.is_available(): # type: ignore[attr-defined]
            return torch.device("mps")
        else:
            return torch.device("cpu")

    def create_dataloader(self, text: str, batch_size:int=4, max_length:int=256, stride:int=128, shuffle:bool=True, drop_last:bool=True, num_workers:int=0) -> DataLoader:
        dataset = GPTDatasetV1(text, self.tokenizer, max_length, stride)
        return DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            drop_last=drop_last,
            num_workers=num_workers
        )

    def loss_for_batch(self, input_batch: torch.Tensor, target_batch: torch.Tensor):
        input_batch, target_batch = input_batch.to(self.device), target_batch.to(self.device)
        logits = self.model(input_batch)
        return nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())

    def calc_loss_loader(self, data_loader, num_batches=None):
        total_loss = 0
        if len(data_loader) == 0:
            return float("nan")
        elif num_batches is None:
            num_batches = len(data_loader)
        else:
            num_batches = min(num_batches, len(data_loader))
        
        for i, (input_batch, target_batch) in enumerate(data_loader):
            if i < num_batches:
                loss = self.loss_for_batch(input_batch, target_batch)
                total_loss += loss.item()
            else:
                break
        return total_loss / num_batches
    
    def run(self, text: str, max_length:int=0, stride:int=0):
        split_idx = int(self.train_ratio * len(text))
        train_data = text[:split_idx]
        validation_data = text[split_idx:]
        torch.manual_seed(123)
        if stride == 0:
            stride = self.cfg["context_length"]
        if max_length == 0:
            max_length = self.cfg["context_length"]
        train_loader = self.create_dataloader(
            train_data,
            batch_size=2,
            max_length=max_length,
            stride=stride,
            drop_last=True,
            shuffle=True,
            num_workers=0,
        )
        validation_loader = self.create_dataloader(
            validation_data,
            batch_size=2,
            max_length=max_length,
            stride=stride,
            drop_last=True,
            shuffle=True,
            num_workers=0
        )
        self.model.eval()
        self.model.to(self.device)
        start = time.time()
        with torch.no_grad():
            training_loss = self.calc_loss_loader(train_loader)
            validation_loss = self.calc_loss_loader(validation_loader)
        elapsed = time.time() - start
        return {
            "training_loss": training_loss,
            "validation_loss": validation_loss,
            "device_type": self.device.type,
            "time_seconds": elapsed
        }


# Example of Loss

The `verdict_loss()` function is basically another smoke test. It loads the public domain book _The Verdict_ and passes it to the untrained model to calculate the loss metrics.

In [13]:
import os
import urllib.request

def verdict_loss():
    file_path = "the-verdict.txt"
    url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    text_data = ""
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode('utf-8')
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
    lc = LossCalculator(GPT_CONFIG_MINI)
    return lc.run(text_data, max_length=256, stride=8)

verdict_loss()

{'training_loss': 10.98558121919632,
 'validation_loss': 10.998483152950511,
 'device_type': 'cuda',
 'time_seconds': 4.307438611984253}

# Training a Model

The following class is nearly a copy of the LossCalculator class above, but it actually trains the model.

In [137]:
from enum import Enum
import json

class DecodingStrategy(Enum):
    Greedy = 1
    Multinomial = 2
    TopK = 3

class TrainGPT:
    def __init__(self, cfg: GPTConfigDict, eval_frequency:int=5, train_ratio:float=0.9, force_cpu:bool=False):
        self.cfg = cfg
        self.tokenizer = tiktoken.get_encoding("gpt2")
        self.train_ratio = train_ratio
        self.force_cpu = force_cpu
        self.device = self.get_device()
        self.model = GPTModel(self.cfg).to(self.device)
        self.optimizer = torch.optim.AdamW( # type: ignore[attr-defined]
            self.model.parameters(),
            lr=0.0004,
            weight_decay=0.1,
        )
        self.eval_frequency = eval_frequency
        self.tokens_seen, self.global_step = 0, -1

    def save(self, name: str):
        torch.save({
            "model_state_dict": self.model.state_dict(),
            "optimizer_state_dict": self.optimizer.state_dict()
        },
        f"{name}.pth")

    def load(self, name: str):
        checkpoint = torch.load(f"{name}.pth")
        self.model.load_state_dict(checkpoint["model_state_dict"])
        self.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

    def get_device(self) -> torch.device:
        if self.force_cpu:
            return torch.device("cpu")
        if torch.cuda.is_available(): # type: ignore[attr-defined]
            return torch.device("cuda")
        elif torch.backends.mps.is_available(): # type: ignore[attr-defined]
            return torch.device("mps")
        else:
            return torch.device("cpu")

    def create_dataloader(self, text: str, batch_size:int=4, max_length:int=256, stride:int=128, shuffle:bool=True, drop_last:bool=True, num_workers:int=0) -> DataLoader:
        dataset = GPTDatasetV1(text, self.tokenizer, max_length, stride)
        return DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=shuffle,
            drop_last=drop_last,
            num_workers=num_workers
        )

    def loss_for_batch(self, input_batch: torch.Tensor, target_batch: torch.Tensor) -> torch.Tensor:
        input_batch, target_batch = input_batch.to(self.device), target_batch.to(self.device)
        logits = self.model(input_batch)
        return nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten()).to(self.device)

    def calc_loss_loader(self, data_loader, num_batches=None) -> float:
        total_loss = 0
        if len(data_loader) == 0:
            return float("nan")
        elif num_batches is None:
            num_batches = len(data_loader)
        else:
            num_batches = min(num_batches, len(data_loader))
        
        for i, (input_batch, target_batch) in enumerate(data_loader):
            if i < num_batches:
                loss = self.loss_for_batch(input_batch, target_batch)
                total_loss += loss.item()
            else:
                break
        return total_loss / num_batches
    
    def loaders(self, text: str, max_length:int=0, stride:int=0) -> tuple[DataLoader, DataLoader]:
        split_idx = int(self.train_ratio * len(text))
        train_data = text[:split_idx]
        validation_data = text[split_idx:]
        torch.manual_seed(123)
        if stride == 0:
            stride = self.cfg["context_length"]
        if max_length == 0:
            max_length = self.cfg["context_length"]
        train_loader = self.create_dataloader(
            train_data,
            batch_size=2,
            max_length=max_length,
            stride=stride,
            drop_last=True,
            shuffle=True,
            num_workers=0,
        )
        validation_loader = self.create_dataloader(
            validation_data,
            batch_size=2,
            max_length=max_length,
            stride=stride,
            drop_last=True,
            shuffle=True,
            num_workers=0
        )
        return (train_loader, validation_loader)
    
    def evaluate(self, text: str, max_length:int=0, stride:int=0, epoch:int=0, prompt:str=""):
        train_loader, validation_loader = self.loaders(text, max_length, stride)
        self.model.eval()
        self.model.to(self.device)
        start = time.time()
        with torch.no_grad():
            training_loss = self.calc_loss_loader(train_loader)
            validation_loss = self.calc_loss_loader(validation_loader)
        elapsed = time.time() - start
        summary = {
            "training_loss": training_loss,
            "validation_loss": validation_loss,
            "device_type": self.device.type,
            "time_seconds": elapsed,
            "epoch": epoch
        }
        if len(prompt) > 0:
            example_output = self.prompt(prompt)
            summary["example_output"] = example_output
        return summary

    def choose(self, strategy: DecodingStrategy, logits: torch.Tensor, temperature:float=1.0, k:int=10):
        match strategy:
            case DecodingStrategy.Greedy:
                probabilities = torch.softmax(logits, dim=-1)
                result = torch.argmax(probabilities, dim=-1, keepdim=True)
                return result
            case DecodingStrategy.Multinomial:
                scaled = logits / temperature
                probabilities = torch.softmax(scaled, dim=-1)
                result = torch.multinomial(probabilities, num_samples=1)
                return result
            case DecodingStrategy.TopK:
                batch_size, vocab_size = logits.shape
                top_logits, top_pos = torch.topk(logits, k)
                filtered = torch.full_like(
                    logits, -torch.inf
                )
                filtered.scatter_(dim=1, index=top_pos, src=top_logits) #huh?
                scaled = filtered / temperature
                probabilities = torch.softmax(scaled, dim=-1)
                if torch.any(torch.isnan(probabilities)) or torch.any(probabilities < 0):
                    print("Bad probabilities:", probabilities)
                    print("Logits:", logits)
                    raise ValueError("NaNs or invalid values in probabilities")
                return torch.multinomial(probabilities, num_samples=1)

    def generate_text_simple(self, token_ids: torch.Tensor, max_new_tokens, context_size):
        self.model.to(self.device)
        for _ in range(max_new_tokens):
            idx_cond = token_ids[:, -context_size:]
            with torch.no_grad():
                logits = self.model(idx_cond)
            logits = logits[:, -1, :]
            idx_next = self.choose(DecodingStrategy.TopK, logits, temperature=0.1)
            token_ids = torch.cat((token_ids, idx_next), dim=1)
        return token_ids

    def prompt(self, text: str, max_tokens:int=10) -> str:
        encoded = self.tokenizer.encode(text)
        encoded_tensor = torch.tensor(encoded).unsqueeze(0).to(self.device)
        self.model.eval()
        self.model.to(self.device)
        out = self.generate_text_simple(
            encoded_tensor,
            max_tokens,
            self.cfg["context_length"],
        )
        decoded_text = self.tokenizer.decode(out.squeeze(0).tolist())
        return decoded_text

    def generate_and_print_sample(self, prompt: str) -> None:
        print(self.prompt(prompt))

    def train_loader(self, training_loader: DataLoader, epochs:int=1):
        torch.manual_seed(123)
        self.model.to(self.device)
        self.tokens_seen = 0

        for epoch in range(epochs):
            self.model.train()
            for input_batch, target_batch in training_loader:
                self.optimizer.zero_grad()
                loss = self.loss_for_batch(
                    input_batch, target_batch
                )
                loss.backward()
                self.optimizer.step()
                self.tokens_seen += input_batch.numel()
                self.global_step += 1


    def train(self, text: str, max_length:int=0, stride:int=0, epochs:int=10, prompt:str="Hello, I am "):
        torch.manual_seed(123)
        loss_summaries = []
        training_loader, _ = self.loaders(text, max_length, stride)
        self.model.to(self.device)
        self.tokens_seen = 0

        for epoch in range(epochs):
            self.model.train()
            for input_batch, target_batch in training_loader:
                self.optimizer.zero_grad()
                loss = self.loss_for_batch(
                    input_batch, target_batch
                )
                loss.backward()
                self.optimizer.step()
                self.tokens_seen += input_batch.numel()
                self.global_step += 1

                if self.global_step % self.eval_frequency == 0:
                    summary = self.evaluate(text, max_length, stride, epoch, prompt)
                    summary["tokens_seen"] = self.tokens_seen
                    print(summary)
                    loss_summaries.append(summary)
        
        self.generate_and_print_sample(prompt)
        return loss_summaries

In [65]:
def verdict_train():
    file_path = "the-verdict.txt"
    url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    text_data = ""
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode('utf-8')
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
    trainable_model = TrainGPT(GPT_CONFIG_MINI, force_cpu=False, eval_frequency=5)
    return trainable_model.train(text_data, epochs=10, prompt="Every effort moves you")

# verdict_train() # uncomment to see the results of training this LLM on "The Verdict"

# Training on Project Gutenberg

To see how far I can take this, I'm going to try to train a model on more and more text. I don't really know how this is going to go, and I'm way beyond either the book or the lectures now.

In [66]:
class BigDatasetTrainer:
    def __init__(self, trainable_model: TrainGPT, dataset):
        self.trainable_model = trainable_model
        self.dataset = dataset
        self.next_item = 0
        self.done = []
        self.tokenizer = tiktoken.get_encoding("gpt2")
    
    def save(self, name: str):
        self.trainable_model.save(name)
        json_state = {
            'done': self.done,
            'next_item': self.next_item,
        }
        with open(f"{name}.json", 'w') as f:
            f.write(json.dumps(json_state))
        print(f"Saved {name}.pth and {name}.json")
    
    def load(self, name: str):
        self.trainable_model.load(name)
        json_state = {}
        with open(f"{name}.json", 'r') as f:
            contents = f.read()
            json_state = json.loads(contents)
        self.done = json_state['done']
        self.next_item = json_state['next_item']
        print(f"Restored state from {name}.pth and {name}.json")
    
    def train_idx(self, n:int):
        self.next_item = n
        self.train_next() # feels a little dirty to do it this way, but I can refactor later
    
    def validate(self, text: str) -> bool:
        tokens = self.tokenizer.encode(text)
        if len(tokens) < 1000:
            return False
        return True
    
    def train_next(self):
        if self.next_item in self.done:
            print(f"Skipping {self.next_item}: already done.")
            return
        text = self.dataset["train"][self.next_item]["text"]
        if not self.validate(text):
            print(f"Skipping {self.next_item}: too short.")
            return
        self.trainable_model.train(text, max_length=256, stride=128, epochs=1, prompt="It is good")
        self.done.append(self.next_item)
        self.next_item += 1

In [146]:
from pathlib import Path
from datasets import DatasetDict
import random

class TokenChunkDataset(Dataset):
    """Designed for turning just one chunk of a larger dataset into a small dataset"""
    def __init__(self, token_ids: list[int], start_pos: int, max_length: int):
        input_chunk = token_ids[start_pos:(start_pos+max_length)]        
        target_chunk = token_ids[(start_pos+1):(start_pos+max_length+1)]
        self.input_ids = torch.tensor(input_chunk, dtype=torch.long)
        self.target_ids = torch.tensor(target_chunk, dtype=torch.long)

    def __len__(self) -> int:
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids, self.target_ids


class HugeDatasetTrainer:
    def __init__(self, trainable_model: TrainGPT, dataset, validation_text: str, max_length:int=512, stride:int=256, tokenizer=tiktoken.get_encoding("gpt2"), cache_name:str="hugedataset", use_cache=True):
        self.trainable_model = trainable_model
        self.dataset = dataset
        self.batches = []
        self.max_length = max_length
        self.stride = stride
        self.tokenizer = tokenizer
        self.cache_name = cache_name
        self.use_cache = use_cache
        self.done = set()
    
    # ensures that self.batches is populated with list[(item_idx, start_pos)]
    def preprocess(self):
        if len(self.batches) > 0:
            return
        if self.use_cache:
            cache_path = f"{self.cache_name}_batches.json"
            cache_file = Path(cache_path)
            cache_contents = ""
            if cache_file.is_file():
                with open(cache_file, 'r') as f:
                    cache_contents = f.read()
                batches = json.loads(cache_contents)
                self.batches = [(item_idx, start_pos) for item_idx, start_pos in batches]
                print(f"Read from cache at {cache_path}")
                return
        # batches not available yet
        for idx in range(0, len(self.dataset['train'])):
            text = self.dataset['train'][idx]['text']
            tokens = self.tokenizer.encode(text)
            num_batches = ((len(tokens) - self.max_length) // self.stride) + 1
            for b in range(num_batches):
                self.batches.append(
                    (idx, b*self.stride)
                )
            if idx % 500 == 0:
                print(f"Processed item {idx} out of {len(self.dataset['train'])}")
        if self.use_cache:
            cache_path = f"{self.cache_name}_batches.json"
            with open (cache_path, 'w') as f:
                f.write(json.dumps(self.batches))

    def save_progress(self):
        self.trainable_model.save(f"{self.cache_name}")
        done_path = f"{self.cache_name}_done.json"
        with open(done_path, 'w') as f:
            f.write(json.dumps(list(self.done)))
        print(f"List of completed batches saved as {done_path}")
    
    def restore_progress(self):
        self.trainable_model.load(f"{self.cache_name}")
        done_path = f"{self.cache_name}_done.json"
        with open(done_path, 'r') as f:
            done = json.loads(f.read())
            self.done = { (item_idx, start_pos) for item_idx, start_pos in done }
        print(f"Restored list of completed batches from {done_path}")
    
    def batches_completed(self) -> int:
        return len(self.done)

    def batches_remaining(self) -> int:
        return len(self.batches) - len(self.done)
    
    def train_one(self):
        if len(self.done) == len(self.batches):
            print(f"Nothing more to do! All {len(self.batches)} batches are complete.")
            return
        next_batch = random.choice(self.batches)
        while next_batch in self.done:
            # no, this isn't super efficient.
            next_batch = random.choice(self.batches)
        item_idx, start_pos = next_batch
        token_ids = self.tokenizer.encode(self.dataset['train'][item_idx]['text'])
        batch_dataset = TokenChunkDataset(token_ids, start_pos, self.max_length)
        batch_dataloader = DataLoader(batch_dataset)
        self.trainable_model.train_loader(batch_dataloader)
        self.done.add((item_idx, start_pos))



In [147]:
from datasets import load_dataset

pg = load_dataset("deepmind/pg19")
hdt = HugeDatasetTrainer(TrainGPT(GPT_CONFIG_MINI), pg, validation_text="the sky is blue", cache_name="pg19")
hdt.preprocess()

Read from cache at pg19_batches.json


In [150]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

print(f"Remaining: {hdt.batches_remaining()}")
print(f"Done: {hdt.batches_completed()}")
#hdt.save_progress()
completed = hdt.batches_completed()
while completed < 10:
    hdt.train_one()
    completed = hdt.batches_completed()
    if completed % 10_000 == 0:
        print(f"Completed: {completed}")
        hdt.save_progress()

Remaining: 11935694
Done: 0


RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [125]:
print(f"Model global step: {hdt.trainable_model.global_step}")
print(hdt.trainable_model.prompt(max_tokens=15, text="Because"))

Model global step: -1
Because Comeswtrompt Philip Corpse 1840 quietlymb CLSID 520.[ DwarfkwDrive Cox


In [129]:
batch_dataloader = hdt.train_one()

In [142]:
[(input.dtype, target) for input, target in batch_dataloader]

[(torch.float32,
  tensor([[4.3800e+02, 1.1690e+03, 1.2838e+04, 2.8600e+02, 6.8300e+02, 7.0500e+02,
           6.2000e+01, 8.7270e+03, 2.6891e+04, 2.6200e+02, 1.0170e+03, 2.7940e+03,
           1.1000e+01, 2.9000e+02, 2.2360e+03, 2.2410e+03, 3.0700e+02, 1.9800e+02,
           6.6490e+03, 3.9100e+02, 6.2000e+01, 3.0960e+04, 3.8300e+02, 8.7830e+03,
           5.0800e+02, 1.2760e+03, 2.4760e+03, 4.4040e+03, 4.6500e+02, 3.8770e+03,
           2.0600e+03, 1.2000e+01, 1.3638e+04, 1.9800e+02, 3.2826e+04, 4.7700e+02,
           4.0100e+02, 3.6400e+02, 1.1000e+01, 2.9000e+02, 3.0250e+03, 2.2100e+03,
           2.7300e+02, 1.2760e+03, 5.8300e+02, 3.1740e+03, 1.0110e+03, 4.6500e+02,
           1.2950e+03, 2.6000e+01, 2.9000e+02, 1.9800e+02, 4.9190e+03, 1.4680e+03,
           4.2800e+02, 1.2838e+04, 7.4300e+02, 3.0700e+02, 1.1000e+01, 1.7700e+03,
           1.3000e+01, 3.6914e+04, 2.6300e+02, 4.6800e+02, 7.8170e+03, 5.1400e+02,
           3.6930e+03, 2.4250e+03, 6.0000e+01, 1.9800e+02, 1.9800e+02,

In [136]:
for x in range(0):
    print("hi")