## Development notebook 
Walking through process intuitively 

In [1]:
# all of shakespeare's works, concatenated 
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-06-01 18:30:23--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2025-06-01 18:30:23 (15.7 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [2]:
with open('input.txt', 'r', encoding='utf-8') as f: 
    text = f.read() 

In [None]:
# 1M chars roughly 
print("number of chars in dataset: ", len(text))

number of chars in dataset:  1115394


In [4]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [5]:
# get unique characters, sorted
# funny that 3 is the only number? 
chars = sorted(list(set(text))) 
vocab_size = len(chars) 
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
# mapping from characters to integers 
# this is tokenizing! 
# Google uses "SentencePiece" https://github.com/google/sentencepiece
# OpenAI uses tiktoken 
# Tradeoff is amount of context vs. size of vocabulary (we have small vocab but lots of context) 
stoi = {ch:i for i,ch in enumerate(chars)} # string to integer 
itos = {i:ch for i,ch in enumerate(chars)} # integer to string 

encode = lambda s: [stoi[c] for c in s] # encodes chars 
decode = lambda l: ''.join([itos[i] for i in l]) # decodes numbers 

In [7]:
print(encode("hellooo"))
print(decode(encode("hellooo")))

[46, 43, 50, 50, 53, 53, 53]
hellooo


In [9]:
import torch # pytorch. Need to use python3.8, doesn't support 3.13 

data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [10]:
# train validation split 
n = int(0.9*len(data))
train = data[:n]
val = data[n:] # need it to be able to generalize on the val set as well 

In [11]:
# We want transformer to be used to seeing any number of chars as input 
# Anything up to block size 
block_size = 8 # aka context size  
x = train[:block_size]
y = train[1:block_size+1]
for t in range(block_size): 
    context = x[:t+1]   
    target = y[t] # y[t] = x[t+1] by definition. always 1 step ahead 
    print(f"when input is {context} the target is: {target}")

when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


In [None]:
# different way of writing the above, more intuitive to me  
block_size = 8 
sample = train[:block_size+1]
for t in range(block_size): 
    context = sample[:t+1]   
    target = sample[t+1]  
    print(f"when input is {context} the target is: {target}")

when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


In [None]:
torch.manual_seed(1337) 
batch_size = 4 # how many text sequences we'll process in parallel 
block_size = 8 # max context length 

def get_batch(split):  
    data = train if split == 'train' else val 
    ix = torch.randint(len(data) - block_size, (batch_size,)) # random offsets into training set
    x = torch.stack([data[i:i+block_size] for i in ix]) # stack = stack 1D tensors as rows
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y 

In [13]:
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape) # 4 sets of length-8 text, randomly sampled from data
print(xb) 

print('targets:')
print(yb.shape) # same as the above but offset by 1 (+1) 
print(yb)

print('-----')
# all of our 32 training examples / observations
for b in range(batch_size): # rows 
    for t in range(block_size): # examples within rows 
        context = xb[b, :t+1]
        target = yb[b,t] # imo easier to remember yb[t] is the same as xb[t+1]
        print(f"when input is {context.tolist()} then target is {target}")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
-----
when input is [24] then target is 43
when input is [24, 43] then target is 58
when input is [24, 43, 58] then target is 5
when input is [24, 43, 58, 5] then target is 57
when input is [24, 43, 58, 5, 57] then target is 1
when input is [24, 43, 58, 5, 57, 1] then target is 46
when input is [24, 43, 58, 5, 57, 1, 46] then target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] then target is 39
when input is [44] then target is 53
when input is [44, 53] then target is 56
when input is [44, 53, 56] then target is 1
when input is [44, 53, 56, 1] then target is 58
when input is [44, 53, 56, 1, 58]

### Starting with a bigram language model -- not sure why? 

In [15]:
import torch 
import torch.nn as nn 
from torch.nn import functional as F 
torch.manual_seed(1337) 

# seems like for some torch things you need to create modules for it? 
class BigramLanguageModel(nn.Module): 
    def __init__(self, vocab_size): 
        super().__init__() 
        # rows are current token, columns are possible next token 
        # values of the Embedding (not sure why it's called an Embedding in this case?) 
        # are probabilities 
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None): 
        # this will be (B,T,C)
        # B = batch, T = time, C = channel 
        # in our case b is 4, t is 8, c is vocab size or 65 
        # i.e. will work for whole batch at once  
        logits = self.token_embedding_table(idx) 

        if targets is None: 
            loss = None 
        else: 
            # how well are we predicting next char based on logits?  
            # annoyingly, cross_entropy expects (B,C,T)       
            B, T, C = logits.shape 
            logits = logits.view(B*T, C) # stretching array out to be 2D 
            targets = targets.view(B*T) 
            loss = F.cross_entropy(logits, targets) 

        return logits, loss
    
    def generate(self, idx, max_new_tokens): 
        # idx = current context of some characters (some batch) 
        # goal: take (B,T) make it (B, T+1), (B, T+2), ... (B, T+max_new_tokens)
        # max_new_tokens = how many more chars we want to generate 
        for _ in range(max_new_tokens): 
            # get predictions
            # apparently calling self will call forward? 
            # that must be the way the nn.Module works
            logits, loss = self(idx) 
            # focus only on last time step (newest char?) 
            logits = logits[:, -1, :] # becomes (B, C) 
            # apply softmax to get probs 
            probs = F.softmax(logits, dim=-1) # still (B,C) 
            # sample from distrn 
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) --> 1 for each batch
            # append sampled index to the running sequence 
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

        return idx 
    
m = BigramLanguageModel(vocab_size) 
logits, loss = m(xb,yb) 
print(logits.shape)
print(loss)

# Note that we should know ideal initialization = -ln(1/65) = 4.1ish 
# so there's some entropy 

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [None]:
# Trying out generate 

# creating a 1x1 tensor with just a 0 inside 
# this will be how we kick-off generation. It's a newline char
# reasonable place to start 
idx = torch.zeros((1,1), dtype = torch.long)

# asking for 100 new tokens 
# indexing in bc generate works at batch level 
# we convert tensor to list and use our decode function 
print(decode(m.generate(idx=idx, max_new_tokens=100)[0].tolist())) 

# right now it's untrained :( 


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [17]:
# create a PyTorch optimizer
# only used stochastic gradient descent in makemore 
# need to research more 
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [24]:
batch_size = 32 
for steps in range(10000): 
    # sample a batch of data 
    xb, yb = get_batch('train')

    # evaluate the loss 
    logits, loss = m(xb, yb) 
    optimizer.zero_grad(set_to_none=True)
    loss.backward() 
    optimizer.step() 

print(loss.item())

2.4571714401245117


In [26]:
# still really bad but definite progress 
print(decode(m.generate(idx=idx, max_new_tokens=200)[0].tolist())) 



Fourthicus ithe
T:
TEROMiaree Wisursethared k I'sprs
Thofr bup indorther wee asthere t bofis p, ces
omave me:
I I tche:

Y l:
A wathigscay ain s sets u ar:
It
LIClorke ghe nd

Morean.
TESThind hert ai


### The mathematical trick in self-attention??? 
We want token 5 not to talk to 6, 7, 8; we do want it to talk to 1-4
Info only flows in from the past (not future - we are trying to predict future) 
Simplest way to do that is avg the preceding tokens? 
Average channels from tokens 1-5 could summarize "me" (token 5) in context of history 
This is extremely lossy but we're accepting it for now 

In [33]:
# toy example 
torch.manual_seed(1337) 
B,T,C = 4,8,2 # batch, time, channels 
x = torch.randn(B, T, C) 
x.shape

torch.Size([4, 8, 2])

In [34]:
xbow = torch.zeros((B,T,C)) # bow short for "bag of words" term for averaging 
for b in range(B): 
    for t in range(T): 
        xprev = x[b, :t+1] # everything up to and including this token 
        xbow[b,t] = torch.mean(xprev, 0)

In [35]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [36]:
# 1st row is same as 1st row above
# 2nd row is avg of 1st and 2nd rows above
# 3rd row is avg of 1-3 rows above etc
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [37]:
# Mathematical trick: we can do this super quickly w matrix algebra 
# columns of C are like running sums of columns of b 
# easy to make this means rather than sums by just normalizing the rows of a 
torch.manual_seed(42) 
a = torch.tril(torch.ones(3,3)) # lower triangle of ones 
b = torch.randint(0, 10, (3,2)).float() 
c = a @ b 
print('a=')
print(a) 
print('b=')
print(b) 
print('--')
print('c=')
print(c) 

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


In [38]:
a = a / torch.sum(a, 1, keepdim=True) 
c = a @ b 
print('a=')
print(a) 
print('b=')
print(b) 
print('--')
print('c=')
print(c) 

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Here wei is (T,T) and x is (B, T, C) 
Pytorch will see the dimensions are different and since they align on right it will make wei into a batched application (B, T, T) and apply (T,T) to each batch. 
Then multiplying (T,T) @ (T,C) gives us (T,C) so we get (B,T,C) out 
Think of this as weighted aggregation as defined by wei! 

In [None]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True) 
xbow2 = wei @ x 
torch.allclose(xbow, xbow2) # the same! 

True

In [None]:
# version 3: use Softmax 
tril = torch.tril(torch.ones(T, T))
# starting at 0 -- think of them as affinities 
# down the line, the affinities between certain characters will be stronger 
# "how interesting they find each other"
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) # this line is saying, future tokens cannot be used 
# remember this exponentiates then divides by sum
# so 0 --> 1, -inf --> 0 
wei = F.softmax(wei, dim=-1) 
xbow3 = wei @ x 
torch.allclose(xbow, xbow3) 

Want to gather info from the past but in a data-dependent way 
* E.g. if I am a vowel, I may be interested in what consonants are in my past 
* And I want that data to flow to me 

THE WAY SELF ATTENTION DOES THIS 

Every single token will emit 2 vectors: a query and a key. 
Query vector: what am I looking for? 
Key vector: what do I contain? 

We basically do a dot product between the keys and the queries 
* The dot product becomes wei 
* If key and query are aligned, they will interact a lot, and you'll learn more 
about that specific token than others. 
* I believe that the fact that head size is 16 means there are 16 channels along which the tokens/nodes can communicate. So one channel might say "I am a consonant!" etc. 
* Then, thru softmax, that info that you're interested in will be aggregated and will influence how we predict the next char 
* Note also that rather than just aggregating x's, you actually aggregate each x's "value" channels. He says you can think of x as private info, value as public info: here is what I will tell you if you find me interesting 

In [None]:
# version 4: self-attention 
torch.manual_seed(1337) 
B,T,C = 4,8,32
x = torch.randn(B,T,C) 

# single head performing self-attention 
head_size = 16 
key = nn.Linear(C, head_size, bias=False) 
query = nn.Linear(C, head_size, bias=False) 
k = key(x) # (B,T,16) 
q = query(x) # also (B,T,16) 
value = nn.Linear(C, head_size, bias=False) 
wei = q @ k.transpose(-2, -1) # transpose last 2 dimensions. (B,T,16) @ (B,16,T) = (B,T,T) 

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) 
wei = F.softmax(wei, dim=-1) 

v = value(x) # we use value rather than aggregating x exactly
# can think of x as being "private info" 
# v is "here is what I will communicate to you"
out = wei @ v # this is now 4x8x16 instead of 32 -- determined by head_size 
# out = wei @ x 

out.shape

torch.Size([4, 8, 32])

In [None]:
# now, every batch element has different vals in wei
wei 

tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
         [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
         [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
         [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1687, 0.8313, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2477, 0.0514, 0.7008, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4410, 0.0957, 0.3747, 0.0887, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0069, 0.0456, 0.0300, 0.7748, 0.1427, 0.0000, 0.0000, 0.0000],
         [0.0660, 0.089

Other notes on attention: 
* Allows nodes to communicate with each other, in a data-dependent manner. In principle, this can be applied to any directed graph, although in our case we have a specific flow (e.g. node 1 can only point to itself, node 2 can only point to node 1 and node 2, etc)
* In some cases your nodes may talk to each other more and not have this notion of some being off limits. E.g. sentiment analysis, you would want all tokens from the piece of text to communicate. In these cases, you would use an encoder block of self-attention -- basically deleting the part where we add -inf value to wei. (Ours is a decoder block apparently... this seems like vocab overload? bc we have a separate encoder and decoder too)
* There is also no concept of space. By default they don't know where they are in the space -- that's why we created an embedding for their position (see positional embeddings in v2.py). This is distinct from convolutional neural networks (where filters act in space). 
* This is called self-attention bc keys, queries, and values are all coming from the same source: x. But in principle, attention is much more general. E.g. you could have queries produced from x, and keys and values coming from a different source. This is called cross-attention. 
* "Scaled" attention adds an important additional normalization. Divide by square root of head size. That seems to make sure that variance of wei is 1. If you scale the numbers in Wei up, softmax will lean more and more towards the max, and then each node is only learning from one other node (at worst).  