## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT. 

Modifications:
input is Don Quixote 
Tokenizer is sentencepiece


In [1]:
# read it in to inspect it
with open('don-quixote.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1714083


In [3]:
# let's look at the first 1000 characters
print(text[:1000])

VOLUME I.


CHAPTER I.

WHICH TREATS OF THE CHARACTER AND PURSUITS OF THE FAMOUS GENTLEMAN DON
QUIXOTE OF LA MANCHA


In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income. The rest of it
went in a doublet of fine cloth and velvet breeches and shoes to match
for holidays, while on week-days he made a brave figure in his best
homespun. He had in his house a housekeeper past forty, a niece under
twenty, and a lad for the field and market-place, who used to saddle the
hack as well as handle the bill-hook. The age of this gentleman of ours
was bordering on fifty; he was of a hardy habit, spare, gaunt-featured, a
very early riser and a gre

In [4]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text))) #list of set of all char in text
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !"'(),-.01246:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
71


using SentencePiece instead of character-level tokenizer

In [5]:
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input = "don-quixote.txt",
    model_prefix = "quixote",
    vocab_size = 4000, #subject to change i think ill try change later
    character_coverage = 1.0, #all chars
    pad_id = 3,

)


In [6]:
sp = spm.SentencePieceProcessor(model_file="quixote.model")

vocab_size = sp.get_piece_size()

encode = lambda s: sp.encode(s, out_type = int)
decode = lambda l: sp.decode(l)

In [7]:
test = "hello fellas"
print(encode(test))
print(decode(encode(test)))
#yusss its working

[14, 187, 101, 634, 209]
hello fellas


In [8]:
print(f'Vocab size: {vocab_size}')

#see what some tokens look like
print('\nSample vocabulary pieces:')
for i in range(10, 30):
    print(f'  Token {i}: "{sp.id_to_piece(i)}"')

Vocab size: 4000

Sample vocabulary pieces:
  Token 10: "▁that"
  Token 11: "▁a"
  Token 12: "▁in"
  Token 13: "▁I"
  Token 14: "▁he"
  Token 15: ";"
  Token 16: "▁it"
  Token 17: "▁""
  Token 18: "."
  Token 19: "▁for"
  Token 20: "▁his"
  Token 21: "▁as"
  Token 22: "ed"
  Token 23: "▁be"
  Token 24: "▁not"
  Token 25: "▁is"
  Token 26: "▁was"
  Token 27: "▁him"
  Token 28: "ing"
  Token 29: "▁with"


In [9]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

print(f"Data shape: {data.shape}") #shakespear was 1.1M  
print(f"dtype: {data.dtype}")
print(f"first 20 tokens: {data[:20]}")
print(f'decoded back: "{decode(data[:20].tolist())}"')


Data shape: torch.Size([425906])
dtype: torch.int64
first 20 tokens: tensor([1078,  266,  503,  461, 2299,   13,   18,  299,   13,   18,  547, 2097,
         194,  214,  163,   45, 2208,  248,  341,  248])
decoded back: "VOLUME I. CHAPTER I. WHICH TREATS OF THE CHARA"


In [10]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]
print(f"train tokens: {len(train_data)} val tokens: {len(val_data)}")

train tokens: 383315 val tokens: 42591


In [11]:
#of uses smaller block size but these tokens carry moremeaning so ill use larger blocks 
block_size = 64
print(f"first block_size+1 tokens: {train_data[:block_size+1]}")
print(f'decoded: "{decode(train_data[:block_size+1].tolist())}"')

first block_size+1 tokens: tensor([1078,  266,  503,  461, 2299,   13,   18,  299,   13,   18,  547, 2097,
         194,  214,  163,   45, 2208,  248,  341,  248,  600,  252,  995,  466,
         308,  461,  341,  194,  461,   76,  252,  194,  214,  163,  559, 3444,
         194, 3657,  550,  565,  214, 2185, 2601,  393,   11,  395,    8,  505,
         630,    4,    5,  221,    8,   52,   13,   34,   71,  465,    7,  364,
           7,  263,    4,   72, 1909])
decoded: "VOLUME I. CHAPTER I. WHICH TREATS OF THE CHARACTER AND PURSUITS OF THE FAMOUS GENTLEMAN DON QUIXOTE OF LA MANCHA In a village of La Mancha, the name of which I have no desire to call to mind, there lived"


In [12]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(min(8, block_size)):  #first 8 example
    context = x[:t+1]
    target = y[t]
    print(f'input: {context.tolist()} ("{decode(context.tolist())}") -> target: {target.item()} ("{decode([target.item()])}")')

input: [1078] ("V") -> target: 266 ("O")
input: [1078, 266] ("VO") -> target: 503 ("L")
input: [1078, 266, 503] ("VOL") -> target: 461 ("U")
input: [1078, 266, 503, 461] ("VOLU") -> target: 2299 ("ME")
input: [1078, 266, 503, 461, 2299] ("VOLUME") -> target: 13 ("I")
input: [1078, 266, 503, 461, 2299, 13] ("VOLUME I") -> target: 18 (".")
input: [1078, 266, 503, 461, 2299, 13, 18] ("VOLUME I.") -> target: 299 ("CHAPTER")
input: [1078, 266, 503, 461, 2299, 13, 18, 299] ("VOLUME I. CHAPTER") -> target: 13 ("I")


In [13]:
torch.manual_seed(1337)
batch_size = 4

def get_batch(split):
    d = train_data if split == "train" else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print(f"inputs shape:  {xb.shape}")
print(f"targets shape: {yb.shape}")

inputs shape:  torch.Size([4, 64])
targets shape: torch.Size([4, 64])


In [14]:
print(xb) # our input to the transformer

tensor([[ 113,    7, 2650, 1449,   80,  181,  482,   67,    7,  136, 1823,    4,
           19,    5, 1327,   25,   21,  578,  182, 2890,   12,  922,    4,    6,
          374,   32,    9,   97, 1274,  631,  381,   37,    6,    5, 3608,   22,
          109,    4,    6, 2871,  509, 2170,   96,   12,  445,  291,   18,  599,
          424, 1969,  758,   73,  415,  464,  998,   30,  887, 3012,   46,   12,
          286,    4,  569,    6],
        [ 266,  855,  194,  282,   76, 3992,  252,  589,   57,  700,   76, 1792,
          466, 2576, 1920,  194,  601, 1703,  598,  385,   76,  194,  728, 1111,
          194, 2935,  559, 1103,  163,  369, 1742,  248,  341,  369,  266,  779,
          728,  840,  555, 1111,  194,   76, 1499,  214, 2220, 2821,  299,  526,
         3992, 3992,   76,  214,  514, 1504,   45, 2936, 2503,  550,  565,  466,
         1178, 3057, 1226, 2210],
        [  20, 1870,    6, 2944,    7,   20,   92, 2962,   15,    5,  787,   75,
           14,  371,    7,   13,  102,   

In [15]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

"""
same bigram model as the tutorial only change is vocab_size is now 4000 instead of 65. 
This means the embedding table is much larger.
"""
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx) # (B,T,C)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(f"logits shape: {logits.shape}")
print(f"loss: {loss}")
#expected initial loss ≈ -ln(1/4000) 8.3 ish
print(f"Expected initial loss: {-torch.log(torch.tensor(1.0/vocab_size)):.2f}")

print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

logits shape: torch.Size([256, 4000])
loss: 8.814183235168457
Expected initial loss: 8.29
 ⁇ witted quietly She neither Quintanona historian wouldst strip overwhelmed lance into carter always fifteen interest bearing Andalusia plaza victory imaginedry entrance dream frightened teach graciousguish body guest quietly madness memor farmer cheer dainty be attire angry France plight worth shepherds ugly weary placeate bone shepherdessen queenwhich concluded robb esquire montera more them eyebrows WAS travelling cousin fault duchessardut miracle LETTER station surname absent imagination caveers entertain plac sense dozen Master abundance grow Saragossa black bigches L ta doubten rash writing straight explaincatime asked presented describe misery slash


In [16]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [17]:
batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"loss after x steps{loss.item()}")


loss after x steps7.879528999328613


In [18]:
print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

 ⁇ red dash rateForagnificence time spur have ke matchGood A exist finep HISTORY boobyCH laws lank ermine fallen sharp truth uncover fairThe shape terms defiance still derive cost wrote evident tempted withdrew fountain clothesot almost haste grievance prudent suspended credit carriesng slave French coast whole sorrow establishdependent lift send adroit circuit withdraw Pasamonte As girths wisdom Rodrigo slash stickTellter folly leave companion sought yard looking torture penance quality fault why feed conversation requisite conscience firmly before usual dear cleverINGk passion bare alarm very DULCINEA once E Montalvan devotion persuasion figure capital BEF Ragged follow trumpet covered manifest persecute misfortunewparent weaksoever bl ceremonyaz window witness reap outcryBe mother filled with amazement GENTLEMAN rest cleav knock Roland verily persuasion beg signature M twentyendWho together burned wickedand Camilla mishap lionSay he bad opposite inquir distant correct prepared Mirro

## The mathematical trick in self-attention

In [19]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


take away is "reading" past information

token 1: [1, 0, 0] only sees itself 
token 2: [1, 1, 0] one previous token and itself 
token 3: [1, 1, 1] all three tokens before it

then what if the model can learn which previous token matter more so R3 might look like
[0.2, 0.7, 0.2].



In [20]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)

In [21]:
#we want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)

#x bag of words 

averaging previous tokens as context 

In [26]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


False

using -inf instead of 0 for masked positons. -inf become 0 after softmax (e^-inf = 0) and the ones become euqual shares. 

In [None]:
wei = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x 
torch.allclose(xbow,xbow3)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [None]:
#version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) -> (B, T, C)
torch.allclose(xbow, xbow2)

False

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


False

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [None]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(1.0918)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) #gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

$$ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

dk is headsize which is a normalization so wei variance is preserved. otherwise softmax will converge to one-hot vectors

In [None]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [None]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))