# Building GPT From Scratch

[Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=11s)

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 05/11/2025   | Martin | Created   | Notebook created to build GPT from scratch | 
| 10/11/2025   | Martin | Update  | Working on self-attention mechanism | 

# Content

* [Load Data](#load-data)
* [Bigram Model](#bigram-model)

In [5]:
%load_ext watermark

In [6]:
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch

# Load Data

Using the Tiny Shakespeare dataset found [here](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).

Training a model on Shakespeare's text to predict the next character given a prompt.

In [7]:
with open('data/shakespeare.txt', 'r') as f:
  text = f.read()

In [8]:
print(f"Length of dataset in chraceters: {len(text)}")

Length of dataset in chraceters: 1115394


In [9]:
# First thousand characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [10]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(f"Number of unique chracters: {vocab_size}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Number of unique chracters: 65


In [11]:
# Create mappers 
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda x: [stoi[c] for c in x] # Convert strings to index
decode = lambda x: ''.join([itos[c] for c in x]) # Convert index to strings

Convert entire text into indexes, then split into train and test

In [12]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [13]:
sp = 0.9
idx = int(sp * len(data))
train_data = data[:idx]
val_data = data[idx:]

Each sequence of letters are training data is actually multiple training instances. The context length grows from the shortest length predicting every next letter till the full sequence of letters is formed

In [14]:
block_size = 8
x = train_data[:8]
y = train_data[1:8+1]
for t in range(block_size):
  inp = x[:t+1]
  out = y[t]
  print(f"When input is {inp}, the target is {out}")

When input is tensor([18]), the target is 47
When input is tensor([18, 47]), the target is 56
When input is tensor([18, 47, 56]), the target is 57
When input is tensor([18, 47, 56, 57]), the target is 58
When input is tensor([18, 47, 56, 57, 58]), the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]), the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58


In [15]:
# Creating minibatch
torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
  data = train_data if split == 'train' else val_data
  idx = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([train_data[i:i+block_size] for i in idx])
  y = torch.stack([train_data[i+1:i+1+block_size] for i in idx])
  return x, y

X_batch, y_batch = get_batch("train")
print("inputs:")
print(X_batch.shape)
print(X_batch)
print("outputs:")
print(y_batch.shape)
print(y_batch)

print("----")

for b in range(batch_size):
  for t in range(block_size):
    context = X_batch[b, :t+1]
    target = y_batch[b, t]
    print(f"When input is {context.tolist()}, the target is {target}")
  print()

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
outputs:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
When input is [24], the target is 43
When input is [24, 43], the target is 58
When input is [24, 43, 58], the target is 5
When input is [24, 43, 58, 5], the target is 57
When input is [24, 43, 58, 5, 57], the target is 1
When input is [24, 43, 58, 5, 57, 1], the target is 46
When input is [24, 43, 58, 5, 57, 1, 46], the target is 43
When input is [24, 43, 58, 5, 57, 1, 46, 43], the target is 39

When input is [44], the target is 53
When input is [44, 53], the target is 56
When input is [44, 53, 56], the target is 1
When input is [44, 53, 56, 1], the target is 58
When input is [44, 53, 56, 1, 58]

---

# Bigram Model

In [16]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x29ba91b5330>

In [17]:
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    # Each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets=None):
    # B - batch size
    # T - time (similar to tracking which set of text has been seen)
    # C - embedding size
    logits = self.token_embedding_table(idx)

    # Reshape the logits tensor to match those used by cross entropy
    if targets == None:
      loss = None
    else:
      B, T, C = logits.shape
      logits = logits.view(B*T, C)
      targets = targets.view(B*T)

      loss = F.cross_entropy(logits, targets)
    return logits, loss
  
  def generate(self, idx, max_new_tokens):
    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):
      # get the predictions
      logits, loss = self(idx)
      # focus on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=1)
      # sample from the distribution
      idx_next = torch.multinomial(probs, num_samples=1)
      # append the sampled index to the existing sequence
      idx = torch.cat((idx, idx_next), dim=1)
    return idx

In [18]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(X_batch, y_batch)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [19]:
starting_idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(starting_idx, max_new_tokens=100)[0].tolist()))


SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


Create a Pytorch optimiser (usually for Adam, lr=1e-4 to 1e-3)

In [20]:
# create a PyTorch optimiser
optimiser = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [21]:
# Training loop
batch_size = 32
for _ in range(100_000):
  X_batch, y_batch = get_batch('train')

  optimiser.zero_grad(set_to_none=True)
  logits, loss = m(X_batch, y_batch)

  loss.backward()
  optimiser.step()

print(loss.item())

2.5319576263427734


In [22]:
print(decode(m.generate(starting_idx, max_new_tokens=100)[0].tolist()))



Ofows ht IUS:
S:

ING flvenje ssutefr,
M:
War cl igagimous pray whars:
Panalit I It aithit terised 


---

# Self-Attention

The _self-attention_ mechanism is where the current token uses information from previous tokens to predict the current token. There are many variations but the most famous one was introduced in the Transformer architecture. Here we step through the different approches in increasing complexity

1. Iteratively find weights by looping
2. Matrix multiplication to aggregate
3. Softmax
4. Self-attention

1. Iteratively find through elementwise looping

In [71]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [72]:
# Inefficient method of computing average (elementwise)
xbow = torch.zeros((B, T, C))
for b in range(B):
  for t in range(T):
    xprev = x[b, :t+1]
    xbow[b, t] = torch.mean(xprev, 0)

In [73]:
# Showcasing the matrix methods is more efficient
torch.manual_seed(43)
a = torch.tril(torch.ones(3, 3))
b = torch.randint(0, 10, (3,2)).float()
print("a=")
print(a)
print("----------")
print("b=")
print(b)
print("----------")
print("sum=")
print(a @ b)
print("Above shows that using a bottom triangle matrix can get the sum of "
      "the weights of previous tokens")
print("----------")

c = a / torch.sum(a, 1, keepdims=True)
print(c)
print("mean=")
print(c @ b)
print("Here is the average")
print("----------")

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
----------
b=
tensor([[8., 8.],
        [5., 7.],
        [5., 0.]])
----------
sum=
tensor([[ 8.,  8.],
        [13., 15.],
        [18., 15.]])
Above shows that using a bottom triangle matrix can get the sum of the weights of previous tokens
----------
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
mean=
tensor([[8.0000, 8.0000],
        [6.5000, 7.5000],
        [6.0000, 5.0000]])
Here is the average
----------


2. Matrix multiplication

In [74]:
# Using matrix multiplication
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdims=True)
xbow2 = wei @ x # (T, T) @ (B, T, C) ---> (B, T, C)

3. Softmax

In [75]:
# Using softmax to create the original weight matrix
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros(T, T)
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow3, xbow2)

True

4. Self-attention

- `key` - A representation of all the existing tokens that came before
- `query` - A request for information from other tokens from the current token
- `key @ query` - Is like a question-answer system: "What tokens are relevant to the current token?" -> If the output is high, then that token's weights will be transferred in this current batch
- `value` - The value matrix is "universal" transformation layer for any embedding batch

In [None]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Single Head of self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
wei = q @ k.transpose(-1, -2) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros(T, T)
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
out.shape

torch.Size([4, 8, 16])

<u>Notes on Attention</u>

- Attention is a __communication mecahnism__. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights
- There is no notion of space between the words, Attention acts over a set of vectors => Why we need positional encodings
- Each example across the batch dimension is processed independently and never "talk" to each other
- In this implementation, we mask the subsequent tokens from the token in question because we are performing a text generation task, so tokens are not allowed to observe subsequent values. However, in other tasks e.g sentiment analysis, this constraint can be removed
  - Remove masking done with `tril` will allow all tokens to communicate. The block is called the decoder attention block because it has triangular masking, and is usually used in autoregressive setting e.g language modeling
- _Self-attention_ means the `keys` and `values` are prodcued from the same source as queries. In _Cross-attention_ the queries are produced from the original data, but the `keys` and `values` come from some external source
- Scaled attention additional divides the `wei` matrix by $1/\sqrt{(head_size)}$. This makes it so what input $Q$, $K$ are unit variance, `wei` will be unit variance too and Softmax will stay diffuse and not saturate that much
  - See below

In [77]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [79]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1)

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

Softmax in the second cell has a much sharper distribution

In [None]:
%watermark