# LLM from scratch

The goal is to code from scratch (using Pytorch) a Transformer, more precisely a GPT-like model (using a decoder-only architecture).

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
batch_size = 64 # number of sequences processed in parallel
n_token = 512 # number of tokens (used for the tokenizer)

## Part 1: Playing with the dataset

Let's download the dataset.

In [3]:
from datasets import load_dataset
ds = load_dataset("huggingartists/bob-dylan", split="train")
ds = ds.train_test_split(test_size=0.1)

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2016
    })
    test: Dataset({
        features: ['text'],
        num_rows: 225
    })
})

In [5]:
ds["train"][12]

{'text': 'They say everything can be replaced\nYet every distance is not near\nSo I remember every face\nOf every man who put me here\nI see my light come shining\nFrom the west unto the east\nAny day now, any day now\nI shall be released\nThey say every man needs protection\nThey say every man must fall\nYet I swear I see my reflection\nSome place so high above this wall\nI see my light come shining\nFrom the west unto the east\nAny day now, any day now\nI shall be released\nStanding next to me in this lonely crowd\nIs a man who swears hes not to blame\nAll day long I hear him shout so loud\nCrying out that he was framed\nI see my light come shining\nFrom the west unto the east\nAny day now, any day now\nI shall be released'}

## Part 2: Word-level tokenizer

The science of decomposing a text into tokens is a complicated one. Here we use `minbpe`, which implements the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

In [6]:
from minbpe.minbpe import RegexTokenizer

In [7]:
partial_text = "\n".join(ds["train"][x]["text"] for x in range(20))
partial_text[502:600]

're\nIve been walking that lonesome valley\nJust tryin to get to heaven before they close the door\nPe'

In [8]:
tokenizer = RegexTokenizer()
tokenizer.train(partial_text, 512, verbose=False)

We encode the entire text datasets and store them into `torch.Tensor`.

In [9]:
train_data = "\n".join(ds["train"][x]["text"] for x in range(ds["train"].num_rows))
val_data = "\n".join(ds["test"][x]["text"] for x in range(ds["test"].num_rows))

train_data = torch.tensor(tokenizer.encode(train_data), dtype=torch.long)
val_data = torch.tensor(tokenizer.encode(val_data), dtype=torch.long)

train_data.shape, train_data[:50]

(torch.Size([1238679]),
 tensor([330, 261, 480, 408, 359, 116, 276, 319, 300, 409,  10, 330, 447, 261,
         297, 117, 418, 419, 322, 267, 260, 107, 105, 304,  10,  73, 280, 385,
         331, 100, 276, 329, 114, 502, 267, 319, 308, 265, 117, 100, 100, 121,
         331, 409,  10,  87, 351, 267, 295, 342]))

In [10]:
test = "hello world"
test_encoded = tokenizer.encode(test)
test_encoded, [tokenizer.decode([x]) for x in test_encoded], tokenizer.decode(test_encoded)

([257, 273, 111, 369, 317], ['he', 'll', 'o', ' wor', 'ld'], 'hello world')

The longest tokens:

In [11]:
token_list = sorted([tokenizer.decode([x]) for x in range(n_token)], 
                    key=len, 
                    reverse=True)
token_list[:20]

[' lonesome',
 ' tonight',
 'onesome',
 ' heaven',
 ' gonna',
 ' right',
 ' broke',
 ' every',
 ' worry',
 ' heave',
 'arling',
 ' that',
 ' your',
 ' with',
 ' hard',
 ' baby',
 ' Lord',
 ' dont',
 ' want',
 ' been']

## Part 3: Evaluating and training with batches

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


Our models will take as input `context_length` many tokens and produce the next token.

In [13]:
context_length_test = 64

ix = torch.randint(len(train_data) - context_length_test, (1,))
sample = train_data[ix:ix+context_length_test+1]
sample_list = sample.tolist()
print("Input: ", sample_list[:-1], 
      "\nTarget: ", sample_list[-1])
print("\nHuman version:\nInput: ", tokenizer.decode(sample_list[:-1]), 
      "\nTarget: ", tokenizer.decode([sample_list[-1]]))

Input:  [114, 424, 10, 80, 310, 97, 315, 44, 465, 114, 115, 46, 32, 72, 312, 311, 44, 465, 114, 115, 46, 32, 72, 312, 311, 44, 306, 310, 97, 315, 10, 80, 310, 97, 315, 44, 465, 114, 115, 46, 32, 72, 312, 311, 44, 465, 114, 115, 46, 32, 72, 312, 311, 44, 306, 310, 97, 315, 10, 73, 109, 442, 301, 302] 
Target:  422

Human version:
Input:  rake
Please, Mrs. Henry, Mrs. Henry, please
Please, Mrs. Henry, Mrs. Henry, please
Im down on my 
Target:   kn


In [14]:
def get_batch(split, context_length):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - context_length - 1, (batch_size,))
    X = torch.stack([data[i:i+context_length] for i in ix])
    Y = torch.stack([data[i+1:i+context_length+1] for i in ix])
    return X, Y

X,Y = get_batch("train", 8)
X.shape, Y.shape

(torch.Size([64, 8]), torch.Size([64, 8]))

We illustrate below how the cross entropy loss is computed along batches (as the mean over the batches). The same works for most functions in Pytorch, which means that writing code for batches is almost as easy as without!

In [15]:
batch_size_test = 3
number_classes_test = 5

logits = torch.randn(batch_size_test, number_classes_test)
target = torch.randint(number_classes_test, (batch_size_test,), dtype=torch.int64)
loss = F.cross_entropy(logits, target)
print("logits: ", logits, "\ntarget: ", target, "\nloss: ", loss.item())

logits:  tensor([[ 0.1251,  0.5692,  1.3950,  1.1641,  0.3252],
        [-1.7733, -0.5491, -0.5605,  0.2494, -1.7450],
        [ 2.8234,  1.2044,  1.8606,  0.0484,  2.3630]]) 
target:  tensor([4, 0, 1]) 
loss:  2.451202869415283


Let us write the boilerplate code for models.

In [16]:
def estimate_loss(model, eval_iters):
    out = {}
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, model.context_length)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    return out

In [17]:
def train(model, n_iterations, learning_rate, eval_interval, eval_iters):
    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for iter in range(n_iterations):
        # every once in a while evaluate the loss on train and validation sets
        if iter % eval_interval == 0 or iter == n_iterations - 1:
            with torch.no_grad():
                losses = estimate_loss(model, eval_iters)
            print(f"step {iter}: train loss {losses['train']:.4f}, validation loss {losses['val']:.4f}")

        X,Y = get_batch("train", model.context_length)
        X.to(device)
        Y.to(device)
        _, loss = model(X, Y)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

## Part 4: A Multi Layer Perceptron (MLP) model

We essentially implement the model from the paper "**A Neural Probabilistic Language Model**" by Bengio et al from 2003.

The first component is an `Embedding` layer: this is simply a lookup table, as illustrated below. It maps every token to a vector in fixed dimension. Since the dimension is much smaller than the number of tokens, intuitively the embedding layer will have to map similar tokens to similar vectors.

In [18]:
n_token_test = 3
n_embed_test = 4

embedding = torch.nn.Embedding(n_token_test, n_embed_test)
print("Weights of the embedding:\n", embedding.weight)
print("Result of embedding token number 1:\n", embedding(torch.tensor([1])))

Weights of the embedding:
 Parameter containing:
tensor([[ 0.6620,  1.2090,  1.1550,  0.7025],
        [-1.4797, -0.8685,  1.0905,  0.0762],
        [-0.4418,  0.0631,  0.0292, -1.0604]], requires_grad=True)
Result of embedding token number 1:
 tensor([[-1.4797, -0.8685,  1.0905,  0.0762]], grad_fn=<EmbeddingBackward0>)


Let us describe how the model works. 

Recall that a datapoint is a tensor `x` of size `context_length`. Each of the `context_length` token is embedded, yielding a tensor of dimension `n_embed`. The resulting embeddings are concatenated to form a tensor of dimension `context_length * n_embed`, and then fed into a standard feed forward network. This is illustrated below (minus the network), with batches.

In [19]:
batch_size_test = 2
context_length_test = 3
n_token_test = 4
n_embed_test = 5

blank_token_test = n_token_test

idx = torch.randint(high = n_token_test, size = (batch_size_test, context_length_test))
print("Input tokens:\n", idx)
embedding = torch.nn.Embedding(n_token_test + 1, n_embed_test)
print("Weights of the embedding:\n", embedding.weight)

print("*******************\n")

x = embedding(idx)
x.shape, x

Input tokens:
 tensor([[0, 0, 1],
        [2, 3, 1]])
Weights of the embedding:
 Parameter containing:
tensor([[-1.1578, -1.6146,  0.6327, -0.4978, -0.4117],
        [-0.1228, -0.9923, -0.2605,  0.6617, -0.3110],
        [-0.4971,  1.0045, -0.7694, -1.7252,  0.6582],
        [-0.6170,  0.0245,  1.2861,  0.9131,  1.1492],
        [ 1.3680, -0.6166,  0.3669,  0.3993,  2.0496]], requires_grad=True)
*******************



(torch.Size([2, 3, 5]),
 tensor([[[-1.1578, -1.6146,  0.6327, -0.4978, -0.4117],
          [-1.1578, -1.6146,  0.6327, -0.4978, -0.4117],
          [-0.1228, -0.9923, -0.2605,  0.6617, -0.3110]],
 
         [[-0.4971,  1.0045, -0.7694, -1.7252,  0.6582],
          [-0.6170,  0.0245,  1.2861,  0.9131,  1.1492],
          [-0.1228, -0.9923, -0.2605,  0.6617, -0.3110]]],
        grad_fn=<EmbeddingBackward0>))

In [20]:
class MLP(nn.Module):
    def __init__(self,
                 context_length = 32,
                 n_embed = 64, 
                 n_hidden = 512):
        super().__init__()
        self.context_length = context_length
        self.n_embed = n_embed
        self.n_hidden = n_hidden
        self.model_type = "MLP"
        self.token_embedding_table = nn.Embedding(n_token, n_embed)
        self.net = nn.Sequential(
            nn.Linear(context_length * n_embed, n_hidden),
            nn.Tanh(),
            nn.Linear(n_hidden, n_token)
        )

    def forward(self, idx, y=None):
        B, T = idx.shape
        # if training: B = batch_size, otherwise B = 1
        # T = context_length

        x = self.token_embedding_table(idx).view(B, -1)
        # x.shape = (B, T * n_embed)
        
        logits = self.net(x) 
        # logits.shape = (B, n_token)
        
        if y is None:
            loss = None
        else:
            # y.shape = (B, T)
            logits = logits.view(B, -1)
            # we only consider the last token for prediction
            y = y[:,-1].view(B)
            loss = F.cross_entropy(logits, y)
        return logits, loss 

In [21]:
model = MLP(context_length = 32,
            n_embed = 64, 
            n_hidden = 512)
model.to(device)
print(sum(p.numel() for p in model.parameters())/1e6, ' M parameters')

1.344512  M parameters


In [22]:
def generate(model, context_length, max_new_tokens = 2000, topk = 5):
    idx = torch.ones((1, context_length), dtype=torch.long) * tokenizer.encode("\n")[0]
    idx.to(device)
    for _ in range(max_new_tokens):
        # we crop at context_length
        idx_cond = idx[:, -context_length:]
        # forward pass
        logits, _ = model(idx_cond)
        
        # for MLP:
        # logits.shape = (batch_size, context_length, n_token)
        # for Transformers:
        # logits.shape = (batch_size, n_token)
        if model.type == "sliding windows":
            logits = logits[:,-1,:]

        # topk
        v, _ = torch.topk(logits, topk)
        logits[logits < v[:, [-1]]] = -float('Inf')

        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1).view((1,1))
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=1)
    return tokenizer.decode(idx[0][context_length:].tolist())

In [23]:
print(generate(model, model.context_length, max_new_tokens = 50))

B��ryin me:imGBSB:ryin:imNow: mele:�=�Bim�d wahatNndnyndqΏ4 had&�ad l(3But�V@�j


In [24]:
train(model, 
      n_iterations = 10000,
      learning_rate = 1e-3,
      eval_interval = 1000,
      eval_iters = 100)

step 0: train loss 6.2815, validation loss 6.2749
step 1000: train loss 4.0349, validation loss 4.1508
step 2000: train loss 3.6604, validation loss 3.8464
step 3000: train loss 3.4801, validation loss 3.7293
step 4000: train loss 3.3946, validation loss 3.6793
step 5000: train loss 3.2864, validation loss 3.5923
step 6000: train loss 3.2060, validation loss 3.6118
step 7000: train loss 3.1557, validation loss 3.4880
step 8000: train loss 3.0884, validation loss 3.4130
step 9000: train loss 3.0442, validation loss 3.4620
step 9999: train loss 2.9619, validation loss 3.3815


In [26]:
print(generate(model, model.context_length, max_new_tokens = 200))

rong you you you

m

r you
It you tale

Youre you
It you are come und
Yourre ass you
You wonna love me
Mr one
It compens
It you sheping a big
Fiers were burwice orr
You couldnt take me to be soeras
When they deace in the cartigs swell
To make your mades quarp
Mever ato cartctor
The cants and the wice
Loo, swier there aint
Nocongooo will
You know gonna have no dongers soon
G
