# GPT from Scratch using Lightning AI and Lance!

This notebook follows the code that I wrote at my talk in Lightning AI meetup in London on 8th November.

I am implementing a GPT model from scratch (including all the modules like CausalAttention, MultiHeadedAttention and FFN) and then binding it all together in LightningAI and training it with the help of it.

## Notes on Data Tokenization and Lance
I am using [Lance](https://github.com/lancedb/lance/) to load our training data. It is a modern columnar data format for ML and LLMs implemented in Rust.

The problem faced was that because of low memory and compute, we can't load the entire TinyStories dataset (about 2.3 GB in size) and tokenize it. The solution was to pre-tokenize the dataset, convert it into a PyArrow table and save it in Lance format.

**Lance essentially allows us to only load some indices of the data at any given moment instead of loading the entire dataset and maxing out the memory**.

If you want to play with Lance and checkout other use cases of it, you can see it's repository: https://github.com/lancedb/lance/

In [1]:
%%sh
pip install -q pyarrow
pip install -q pylance
pip install -q lightning

In [2]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import lightning

import lance
import pyarrow as pa

from tqdm.auto import tqdm
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")



Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Dataset and Creation

We load the TinyStories dataset and then tokenize 100K sentences from it and save it as a PyArrow table in a lance file.

In [3]:
from datasets import load_dataset
dataset = load_dataset("roneneldan/TinyStories", data_files={'train': 'TinyStoriesV2-GPT4-train.txt'})

Downloading and preparing dataset text/roneneldan--TinyStories to /root/.cache/huggingface/datasets/text/roneneldan--TinyStories-e7877524f0320955/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.23G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/roneneldan--TinyStories-e7877524f0320955/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
# Join 'total_rows' number of sentences one after another
all_tokens = []
# Only choosing 1K sentences for now. Increase if you want to train it for longer on larger hardware
total_rows = 1000
data = dataset['train'].select([x for x in range(total_rows)])
for row in tqdm(data['text'], total=len(data)):
    row = row.replace("<|endoftext|>", " ")
    encoded = tokenizer(row)['input_ids']
    all_tokens.extend(encoded)

pa_table = pa.Table.from_arrays([all_tokens], names=['value'])
lance.write_dataset(pa_table, "tiny_stories_gpt4_encoded.lance", {'model': 'create'})

print(f"Total tokens in tokenized dataset: {len(all_tokens):,.0f}")

  0%|          | 0/1000 [00:00<?, ?it/s]

Total tokens in tokenized dataset: 31,603


## Model and Training

In [5]:
class Config:
    vocab_size = 50304 # changing it from 50257 to the nearest multiple of 64 which will boost ops
    n_epochs = 50
    batch_size = 48
    lr = 3e-4
    wd = 1e-6
    n_embed = 256
    num_blocks = 12
    num_heads = 12
    head_size = n_embed // num_heads
    context_len = 224
    attn_dropout_val = 0.2
    mha_dropout_val = 0.2
    ffn_dropout_val = 0.2

## Attention - `CausalAttentionHead` and `MultiHeadedAttention`

In [6]:
class CausalAttentionHead(nn.Module):
    def __init__(self, config):
        super(CausalAttentionHead, self).__init__()
        self.config = config

        # QKV layers
        self.query = nn.Linear(config.n_embed, config.head_size, bias=False)
        self.key = nn.Linear(config.n_embed, config.head_size, bias=False)
        self.value = nn.Linear(config.n_embed, config.head_size, bias=False)
        self.attn_drop = nn.Dropout(config.attn_dropout_val)

        # Mask for ensuring causality during training
        self.register_buffer('mask', torch.tril(torch.ones(config.context_len, config.context_len)))

    def forward(self, x):
        # Shape of x: [bs, context_len, embed_dim]
        bs, context_len, embed_dim = x.shape
        q, k, v = self.query(x), self.key(x), self.value(x)

        # Get the attention weights
        attn_filter = torch.divide(torch.bmm(q, k.transpose(1, 2)), self.config.head_size)
        attn_filter = attn_filter.masked_fill(self.mask[:context_len, :context_len]==0, float('-inf'))
        attn_weights = F.softmax(attn_filter, dim=-1)
        attn_weights = self.attn_drop(attn_weights)

        # Now we do weighted aggregation of values to get the output of attention
        # attn_weights [bs, c, c] x V [bs, c, h] = output [bs, c, head_size]
        output = torch.bmm(attn_weights, v)
        return output

In [7]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, config):
        super(MultiHeadedAttention, self).__init__()
        self.config = config

        # Turn all the AttentionHeads into a ModuleList
        self.heads = nn.ModuleList(
            [CausalAttentionHead(config) for _ in range(config.num_heads)]
        )

        # Projection and Dropout that projects mha_output it back to n_embed dim
        self.proj = nn.Linear(config.num_heads*config.head_size, config.n_embed)
        self.mha_drop = nn.Dropout(config.mha_dropout_val)

    def forward(self, x):
        # Concatenate all the attention head outputs together
        mha_output = torch.cat([head(x) for head in self.heads], dim=-1)
        return self.mha_drop(self.proj(mha_output))

## FeedForward Network

In [8]:
class FeedForwardNet(nn.Module):
    def __init__(self, config):
        super(FeedForwardNet, self).__init__()

        self.ffn = nn.Sequential(
            nn.Linear(config.n_embed, config.n_embed*4),
            nn.GELU(),
            nn.Linear(config.n_embed*4, config.n_embed),
            nn.Dropout()
        )

    def forward(self, x):
        return self.ffn(x)

## One Single Block of the GPT model

In [9]:
class Block(nn.Module):
    def __init__(self, config):
        super(Block, self).__init__()

        # Architecture of one block of GPT
        self.mha = MultiHeadedAttention(config)
        self.ln1 = nn.LayerNorm(config.n_embed)
        self.ffn = FeedForwardNet(config)
        self.ln2 = nn.LayerNorm(config.n_embed)

    def forward(self, x):
        x = self.ln1(x + self.mha(x))
        x = self.ln2(x + self.ffn(x))
        return x

## Entire GPT model, end-to-end

In [10]:
class GPT(lightning.LightningModule):
    def __init__(self, config):
        super(GPT, self).__init__()
        self.config = config
        self.save_hyperparameters()

        # Define token and positional embeddings
        self.token_embedding = nn.Embedding(config.vocab_size, config.n_embed)
        self.positional_embedding = nn.Embedding(config.context_len, config.n_embed)

        # Define the blocks
        self.backbone = nn.Sequential(*[Block(config) for _ in range(config.num_blocks)])

        # Define the LM head
        self.lm_head = nn.Linear(config.n_embed, config.vocab_size)

    def forward(self, x):
        # Apply token embeddings through the data (B, C) -> (B, C, V)
        tok_emb = self.token_embedding(x)

        # Get positional embeddings using torch.arange
        pos_emb = self.positional_embedding(torch.arange(x.shape[1], device=self.device))

        # Add both embeddings
        x = tok_emb + pos_emb

        # Pass the input data through all blocks
        x = self.backbone(x)

        # Pass it through the lm head
        logits = self.lm_head(x)
        return logits

    def get_loss(self, predictions, target):
        B, C, V = predictions.shape
        predictions = predictions.view(B*C, V)
        target = target.view(B*C)
        loss = F.cross_entropy(predictions, target)
        return loss

    def training_step(self, batch, batch_idx):
        text, target = batch
        text = text.long()
        target = target.long()
        logits = self(text)
        loss = self.get_loss(logits, target)

        self.log('loss', loss.item(), prog_bar=True)

        logs = {'loss': loss}
        return {'log': logs, 'loss': loss}

    def training_end(self, outputs):
        avg_loss = torch.stack([x['log']['loss'] for x in outputs]).mean()

        logs = {'loss': avg_loss}

        print(f"val_loss: {avg_loss}")
        return {'log': logs}

    def configure_optimizers(self):
        opt = torch.optim.AdamW(self.parameters(), lr=self.config.lr, weight_decay=self.config.wd)
        return [opt], []

def generate(model, prompt, max_tokens, temperature=0.7):
    """
    Generates text based on the provided prompt.
    Model determinism can be changed with temperature 
    (range: [0, 1], higher means more unstable but creative predictions)
    """
    model.eval()
    for _ in range(max_tokens):
        prompt = prompt[:, :config.context_len]
        logits = model(prompt)
        logits = logits[:, -1, :] / temperature
        logit_probs = nn.functional.softmax(logits, dim=-1)
        next_prompt = torch.multinomial(logit_probs, num_samples=1)
        prompt = torch.cat((prompt, next_prompt), dim=1)
    return prompt

## GPTDataset for efficient and fast data loading
Thanks to Lance!

In [11]:
class GPTDataset(Dataset):
    def __init__(self, dataset_path, context_len):
        # Load the lance dataset from the saved path
        self.ds = lance.dataset(dataset_path)
        self.context_len = context_len
        # Doing this so the sampler never asks for an index at the end of text
        self.length = self.ds.count_rows() - context_len

    def __len__(self):
        return self.length

    def from_idxs(self, idxs):
        """
        Little Utility function to get the data from lance
        """
        data = self.ds.take(idxs).to_pylist()
        data = torch.tensor(list(map(lambda x: x['value'], data)))
        return data

    def __getitem__(self, idx):
        """
        Generate a list of indices starting from the current idx to idx+context_len+1
        Use the from_idxs function to get data in said indexes and then divide it into features (x) and target (y)
        """
        current_window_idxs = np.arange(idx, idx+self.context_len+1)
        data = self.from_idxs(current_window_idxs)
        x = data[0:self.context_len]
        y = data[1:self.context_len+1] # +1 because our target is the sentence is 1 step ahead of input text
        return x, y

## Finally, let's train the model!

We'll train the model for 50 epochs which should take ~5 hours to train. Change the number of epochs and other hyperparams in the `Config` class if you are training for longer and on more powerful hardware.

In [12]:
if __name__ == "__main__":
    # Path of the encoded lance dataset
    dataset_path = "tiny_stories_gpt4_encoded.lance"

    # Init config
    config = Config()

    # Init model
    gpt = GPT(config)

    # Init the dataset
    dataset = GPTDataset(dataset_path, config.context_len)
    loader = DataLoader(
        dataset,
        batch_size=config.batch_size,
        shuffle=False,
    )

    # Init the trainer
    trainer = lightning.Trainer(accelerator='auto', max_epochs=config.n_epochs)

    # Fit on the data
    trainer.fit(gpt, loader)

INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name                 | Type       | Params
----------------------------------------------------
0 | token_embedding      | Embedding  | 12.9 M
1 | positional_embedding | Embedding  | 57.3 K
2 | backbone             | Sequential | 9.4 M 
3 | lm_head              | Linear     | 12.9 M
----------------------------------------------------
35.3 M    Trainable params
0         Non-trainable params
35.3 M    Total params
141.128   Total estimated model params size (MB)
/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performanc

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=50` reached.


## Generate some text!

Let's see how much our model learnt about the text data by asking it to generate some text given a prompt

In [13]:
# Generate some predictions
prompt = "My cat is" # Change the prompt to whatever you want

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
gpt = gpt.to('cuda')
prompt = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
generated_text = generate(gpt, prompt, max_tokens=config.context_len, temperature=0.7)
generated_text = tokenizer.decode(generated_text.tolist()[0])
print(generated_text)

My cat is very special. She named her angel Lily.One day, Anna and Lily went to the park with her mom. They saw a big slide, a swing, and a sandbox. Anna wanted to play with everything. Anna has a swing, but they were playing in the slide, but they all lived happily in the slide. She asked her mom, "Can I go on the slide, mom?""Yes, Anna. She likes to play too," her mom said.Anna nodded and slid down. She laughed and said, "Whee! That was fun, Lily!"Anna was fun, "Look, you can fly like a real angel!"Then she went to the sandbox. She pushed Lily on top and said, "Look, Lily, Lily. She is my angel!"Anna was having a lot of fun. She put Lily on top and said, "You are the castle, Lily!"Anna was having a lot of fun. But she did not see the unknown boy who came to the sandbox. He was bigger than Anna and wanted to take her. He saw Lily from the castle and
