# Let's build GPT: from scratch, in code, spelled out.

A development notebook for building a nanoGPT, following along with Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out." tutorial on YouTube.

Github: https://github.com/jt-thorpe/Lets-Build-GPT.git.

## The Dataset

In [71]:
# Get the tiny Shakespeare dataset
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-02-11 10:39:58--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.3’


2024-02-11 10:39:59 (2.44 MB/s) - ‘input.txt.3’ saved [1115394/1115394]



In [72]:
# Read file
with open('input.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [73]:
print("Length of text: ", len(text))

Length of text:  1115394


In [74]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



### Processing

We need to process the dataset and get something workable:
- so things such as getting the unique characters from the dataset; the ***vocabulary***
- useful info; such as size of vocabulary
- tokenise the vocab; there are many ways to do this
  - character level (which we are doing)
  - word level
  - subword level; tikToken from OpenAI or SentencePiece from Google

In [75]:
# Gather the vocabulary; that is, the unique characters in the text
vocab = sorted(list(set(text)))
vocab_size = len(vocab)
print("Vocabulary size: ", vocab_size)
print(f"Vocabulary: {vocab}")

Vocabulary size:  65
Vocabulary: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [76]:
# Create a mapping from character to numerical representation
char_to_index = {char: index for index, char in enumerate(vocab)}
index_to_char = {index: char for index, char in enumerate(vocab)}

encode = lambda text: [char_to_index[char] for char in text]  # encoder: takes a string, returns a list of integers
decode = lambda enc_text: ''.join([index_to_char[index] for index in enc_text])  # decoder: takes a list of integers, returns a string

print("Encoded text: ", encode("It's free real estate..."))
print("Decoded text: ", decode(encode("It's free real estate...")))

Encoded text:  [21, 58, 5, 57, 1, 44, 56, 43, 43, 1, 56, 43, 39, 50, 1, 43, 57, 58, 39, 58, 43, 8, 8, 8]
Decoded text:  It's free real estate...


**NB:** Our selection to use a character to integer mapping here reduces the size of our vocabulary considerably, but increases the length of our output considerably, compared to say per-word or per-subword encodings, which would increase the size of our vocabulary but decrease the length of our encoding outputs.

In [77]:
# Encode the entire text, store it in a tensor
import torch

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:250])  # The same 250 characters as before, but now encoded as integers

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

## Train/Test/Split

In [78]:
# Split the data into a training and validation set
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

#TODO: Why did Andrej use 0.9 for the training set here, why so much, 90% seems quite high? or is it not? Find out.

When training a model, we will not be feeding all our data to the model in one go, this is way to computationally inefficient. We want to feed in chunks, or blocks, at a time and train on that.

In addition, we want our model to be "*used to* training on as small a context size (block_size) as 1, so in the future, even it only sees one character it "*knows*" how to make predictions on such a small context size, and in between, all the way up to a context of block_size. As such, we need to define our block size.

## block_size

In [79]:
# Our data block_size
block_size = 8

train_data[:block_size + 1]  # +1 ensures we actually have 8 training examples for our block_size of 8

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [80]:
# An example to show why +1 is necessary
x = train_data[:block_size]
y = train_data[1:block_size + 1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f" When input is {context} then target is {target}")

 When input is tensor([18]) then target is 47
 When input is tensor([18, 47]) then target is 56
 When input is tensor([18, 47, 56]) then target is 57
 When input is tensor([18, 47, 56, 57]) then target is 58
 When input is tensor([18, 47, 56, 57, 58]) then target is 1
 When input is tensor([18, 47, 56, 57, 58,  1]) then target is 15
 When input is tensor([18, 47, 56, 57, 58,  1, 15]) then target is 47
 When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) then target is 58


Essentially, when we pass a block to the model, it is not just learning what the predicted next token is when it sees that block, it learns **all** of the possible combinations (in the order of the block still), for that block.

So as demonstrated above, if we do not have +1 then we would only have 7 training examples for the model, for our block_size. This would mean we are missing the chance to learn the next predicted token for when there are 8 characters, hence we need an additional character (token).

At each index, the i+1'th index is the target token the model will calculate a probability of appearing for.

## batch_size

So we also need to define a batch_size, as when we are training our models on a GPU (which are designed for parallel processing), we want to be training lots of batches of chunks, all in parallel to each other as this is most efficient.

In [81]:
torch.manual_seed(1337)  # Set seed for reproducibility

batch_size = 4  # How many independent sequences to process in parallel
block_size = 8  # The length of each sequence (i.e. the context size for prediction)

def get_batch(split):
    # Generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))  # Starting index for each sequence
    x = torch.stack([data[i:i+block_size] for i in ix])  # The input sequences
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # The target sequences
    return x, y

xb, yb = get_batch('train')
print("Inputs: ")
print(xb.shape)
print(xb)
print("\nTargets: ")
print(yb.shape)
print(yb)

print("\n-----\n")

for b in range(batch_size):  # batch dimension
    for t in range(block_size):  # time dimension
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context} then target is {target}")

Inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

Targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

-----

When input is tensor([24]) then target is 43
When input is tensor([24, 43]) then target is 58
When input is tensor([24, 43, 58]) then target is 5
When input is tensor([24, 43, 58,  5]) then target is 57
When input is tensor([24, 43, 58,  5, 57]) then target is 1
When input is tensor([24, 43, 58,  5, 57,  1]) then target is 46
When input is tensor([24, 43, 58,  5, 57,  1, 46]) then target is 43
When input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) then target is 39
When input is tensor([44]) then target is 53
When input is tensor([44, 53]) then target is 56
When input is tensor([44, 53, 56

`x = torch.stack([data[i:i+block_size] for i in ix])  # The input sequences`

`y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # The target sequences`

By using the torch.stack here we are taking each 1d-tensor (e.g. [24, 43, 58,  5, 57,  1, 46, 43] etc...) and stack them as rows into one 4x8-d tensor, as shown in the demonstration.

## Feeding the Neural Net

In [82]:
torch.manual_seed(1337)  # Set seed for reproducibility

import torch
import torch.nn as nn
import torch.nn.functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)  # A

    def forward(self, idx, targets=None):
        """Forward phase of the network."""

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (Batch = 4, Time = 8, Channels = vocab_size = 65)-d tensor

        # This is to make our tensor conform to how cross_entropy expects the input
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)  # Stretch thr 1st and 2nd dimension into one, with C now being the 2nd dimension
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """Generate a sequence of new tokens.
        
        Works on the level of batches.
        """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities; a function to puhs values to 0-1 and sum to 1
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


So, the output above is the random gibberish we get from our model while it is not trained. Lets now train it.

## Training the NNet

The optimiser will adjust the parameters of the model

In [83]:
# Create a PyTorch optimiser
optim = torch.optim.Adam(m.parameters(), lr=1e-3)  # Popular rates are: 3e-4 (on smaller networks, higher learning rates are more feasible)

In [84]:
# A standard training loop
batch_size = 32

for steps in range(10000):

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = m(xb, yb)

    # Backward pass
    optim.zero_grad()
    loss.backward()
    optim.step()

print(f"Step {steps}, loss: {loss.item()}")

Step 9999, loss: 2.3796486854553223


**NB:** Remember, each time we run the model the gradients adjust (i.e. the parameters from the previous run are still here) each time we run it, improving our loss.

In [85]:
# Let's see what our model has learned
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


llo br. ave aviasurf my, mayo t ivee iuedrd whar ksth y h bora s be hese, woweee; the! KI 'de, ulseecherd d o blllando;

Whe, oraingofof win!
RIfans picspeserer hee tha,
TOFonk? me ain ckntoty dedo bo'llll st ta d:
ELIS me hurf lal y, ma dus pe athouo
By bre ndy; by s afreanoo adicererupa anse tecorro llaus a!
OLeneerithesinthengove fal amas trr
TI ar I t, mes, n sar; my w, fredeeyong
THek' merer, dd
We ntem lud engitheso; cer ize helorowaginte the?
Thak orblyoruldvicee chot, pannd e Yolde Th li


As we can See, it is still gibberish, but this gibberish is starting to take form! (So fucking cool...)

### So why is this model still producing nothing useful?

Well, in this iteration the tokens that we are learning are not "talking" to each other, they aren't impacting each other. We're just looking at the alst character for each token when we're making a prediciton.

We need to have these tokens look at each other within their context, to make better predictions. This leads us onto building a **transformer**.

## Building a Transformer

### A mathematical trick in self-attention

In [86]:
# Consider the following toy example
torch.manual_seed(1337)
B, T, C = 4,8,2 # Batch, Time, Channels - The batches, the time steps and the channels (i.e. some information at each point in the sequence)
x = torch.randn(B, T, C) # Random input
x.shape

torch.Size([4, 8, 2])

So we have 8 token in a batch, and currently they do not talk to each other. We want to "couple" them.

In partiuclar, in a very specific way. For example, the token in the 5th position should not couple with tokens in the 6,7,8th position. It should **not** couple with future tokens- only those in the past- so the 1,2,3,4th tokens.

We cannot get any new information from the future, as we are about to try and *predict* the future. I.e. we're trying to predict the 6th token, using the info of the previous tokens.

So how does a token communicate with the previous tokens. The simplest way would be to average the preceeding elements.

So if I am the 5th token, I want to take the channels that are information at my step, but also the channels for the 4,3,2,1st steps also. I average these up and get a feature vector that summarises me (the 5th token) in the context of my history.

This a ***very weak form of interaction***- this interaction is very lossy- we lose a ton of information about the spacial arrangements of those tokens, but this is okay for now.

So what do we want to do for now:
- for every single batch element independently (every t'th token)
- calculate the average of all the vectors of the previous steps plus this step

In [87]:
# We want `x[b,t] = mean_{i<=t} x[b,i]`
xbow = torch.zeros((B, T, C))  # bow = bag of words; a common phrase when averaging over words

# Loop not efficient, but good for understanding
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]  # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)  # Average over the time dimension

In [88]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [89]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

- So 1st row of xbow is an average of the 1st row of x
- the 2nd row of xbow is an average of the 1st and 2nd row of x
- 3rd row of xbow is an average of the 1st, 2nd and 3rd row of x
- etc

### Here comes the trick!

For efficiency, we dont need to loop, we can do this with matrix-multi very efficiently.

In [90]:
# Another toy example
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print("a=")
print(a)
print("--")
print("b=")
print(b)
print("--")
print("c=")
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


So with some matrix magic, we can take a lower-triangle matrix of 1s and 0s (i.e. 0s in all i,j index above mid diag)

So when we now do he dot product, we're basicially summing the rows as described above in the loop.

In [91]:
torch.tril(torch.ones(3,3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [92]:
# Another toy example
torch.manual_seed(42)

a = torch.tril(torch.ones(3,3))
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print("a=")
print(a)
print("--")
print("b=")
print(b)
print("--")
print("c=")
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


Voila! We now have efficient summing of rows without any looping, using matrix magic. Bon ap. With a small adjustment, we can then make this into the averages we desire.

In [93]:
# Another toy example
torch.manual_seed(42)

a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a, dim=1, keepdim=True)  # Normalise the rows
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print("a=")
print(a)
print("--")
print("b=")
print(b)
print("--")
print("c=")
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Here we see that c is now rows of averages of each row plus its previous.

### Vectoring the loop

In [113]:
# Consider the following toy example
torch.manual_seed(42)
B, T, C = 4,8,2 # Batch, Time, Channels - The batches, the time steps and the channels (i.e. some information at each point in the sequence)
x = torch.randn(B, T, C) # Random input
x.shape

torch.Size([4, 8, 2])

In [114]:
"""Version 1"""
# We want `x[b,t] = mean_{i<=t} x[b,i]`
xbow = torch.zeros((B, T, C))  # bow = bag of words; a common phrase when averaging over words

# Loop not efficient, but good for understanding
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]  # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)  # Average over the time dimension

In [115]:
"""Version 2"""
wei = torch.tril(torch.ones(T,T))
wei = wei / wei.sum(1, keepdim=True)  # Normalise the rows
xbow2 = wei @ x  # (B,T,T) @ (B,T,C) ----> (B,T,C) (PyTorch will broadcast (add) the first dimension)

torch.allclose(xbow, xbow2)

True

In [116]:
"""Version 3"""
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))  # Set the upper triangle to -inf : "The future cannot communicate with the past"
wei = F.softmax(wei, dim=1)  # Softmax over the time dimension
xbow3 = wei @ x

torch.allclose(xbow, xbow3)

True

In [117]:
xbow[0], xbow2[0], xbow3[0]

(tensor([[ 1.9269,  1.4873],
         [ 1.4138, -0.3091],
         [ 1.1687, -0.6176],
         [ 0.8657, -0.8644],
         [ 0.5422, -0.3617],
         [ 0.3864, -0.5354],
         [ 0.2272, -0.5388],
         [ 0.1027, -0.3762]]),
 tensor([[ 1.9269,  1.4873],
         [ 1.4138, -0.3091],
         [ 1.1687, -0.6176],
         [ 0.8657, -0.8644],
         [ 0.5422, -0.3617],
         [ 0.3864, -0.5354],
         [ 0.2272, -0.5388],
         [ 0.1027, -0.3762]]),
 tensor([[ 1.9269,  1.4873],
         [ 1.4138, -0.3091],
         [ 1.1687, -0.6176],
         [ 0.8657, -0.8644],
         [ 0.5422, -0.3617],
         [ 0.3864, -0.5354],
         [ 0.2272, -0.5388],
         [ 0.1027, -0.3762]]))

### Why "Version 3" over the others?

- Well, we first start out with the weight matrix, wei, as all 0s;
- we then fill the weight matrix with -inf for all **future** tokens- we dont want to consider any values from the future
- the weights in the wei matrix can be thought of each tokens affinity for another token- they are data dependent- some tokens will become interested in other tokens, and to varying amounts
- so when we normalise (softmax) and then sum, we're agregating their values absed on how interesting they find each other
- this is **ATTENTION**