# Let's build GPT: from scratch, in code, spelled out

# The Dataset: Shakespeare

In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!curl -L https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -o input.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  1863k      0 --:--:-- --:--:-- --:--:-- 1861k


In [2]:
# read in in to inspect
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print('leng of the dataset in character: ', len(text))

leng of the dataset in character:  1115394


In [4]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# here all the unique characters that occurs in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


# Some words on tokenizing

Andrej said mean convert the raw text, to a some sequence of integer according to some vocabulary. In my understanding, when people use the word "tokenizing", they refer to the process of "breaking" raw strings into chunks. Each chunks have a unique identifier. The process of breaking can be done following different strategies or schemas. For instance, google uses [sentence piece](https://github.com/google/sentencepiece) to encode text into integers. It is sub word tokenizer. It does not encode entire words, and it does not encode down to the characters as well. It is somewhere in the middle... sub words tokenizer.

Since we are going to build a **character-level** language model we going to be translating individual characters into integers. In other words, we are breaking raw text into a series of integers, where each integer represents a unique character in the entire vocabulary. 

The following Python cell does the "encoding" and the "decoding".

In [6]:
# Tokenize - turn characters into numbers or vice versa
# Create a mapping from characters to integers.
stoi = { ch:i for i,ch in enumerate(chars)}
itos = { i:ch for i,ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s] # encoder: takes a string, output a list of integers.
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: takes a list of integers, outputs a string.

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


With the tokenizer in place, the following the Python cell encodes the entire tiny shakespeare dataset, and wraps it into a torch tensor. This tensor, is a long sequence of integers.

In [7]:
# Tokenize the entire text dataset
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [8]:
# Split the dataset into train
n = int(0.9 * len(data)) # first 90% will be train, rest validation
train_data = data[:n]
val_data = data[n:]

# Data loader: batches of chunks of data

When training a transformer, we do **not** feed the entire training set at once, that would be computationally expensive! We work with chunks of the dataset instead.

When training, Andrej said, we sample random chunks out of the training set, and train on those "chunks". Those same chunks have a maximum length. This length is called **block size**, also known as **context size**.

Let's consider the first block size.

In [9]:
block_size = 8 # AKA "context" size
train_data[: block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

These are the first 9 characters in the training set. It's important to note that when you sample a chunk of data like this, there are multiple examples packed into it. That's because all those characters follow each other. In a chunk of 9 characters, there are **8 examples packed into it**. Let's see those examples using the following code:

In [10]:
# x and y are offset by one
x = train_data[:block_size] # The input to the transformer
y = train_data[1:block_size + 1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")


when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


Andrej said that it is not for nothing that we spelled out all the examples packed into the first block size. Showing the model prediction target when the block size is one, all the way up to "block size", teaches the model how to predict when the context size varies between between `1`, and `block_size`. In other words, our transformer won't have to be fed `block_size` all the time to be able to make a prediction.

We won't feed one chunk of text at the time in the transformer. Instead we will feed batches of them into our transformer. They will be processed independently of each other.

**Side Note: You can't do that using an RNN**

The above code is generalized to include a batch dimension, and it looks like this:

In [11]:
torch.manual_seed(1337)
batch_size = 4 # How many independent chunk will we process in parallel?
block_size = 8 # What is the maximum context length for predictions?

def get_batch(split):
    # Selects the appropriate dataset (train_data or val_data) based on the value of the "split" parameter
    data = train_data if split == 'train' else val_data

    # Returns a tensor of random values between ZERO and "len(data) - block_size"
    # "len(data) - block_size" because we want to grab the 'block_size' target tokens, without getting out of range.
    # (batch_size,) is tuple defining the SHAPE of the output tensor. In our case it's a 1D (an array) of "batch_size" elements.
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # 'x' is a stack. Stack of what? A stack tensors starting at EACH number in 'ix' and of size 'block_size'
    # if 'ix' contains 5 for instance, we go at the fifth position in data, and grab the chars from position 5 to position (5 + block_size).
    # We repeat the same process for each number in 'ix'
    # You may see each number in 'ix' as random starting points to grab "chunks" from.
    x = torch.stack([data[i:i+block_size] for i in ix])

    # 'y' is a stack. A stack containing the targets of each training example in 'x'
    # Notice the offset by one.
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    # Both 'x' and 'y' have dimensions (batch_size x block_size)
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [12]:
print(xb) # our input to the transformer. It made of 4 chunks. Each chunk has multiple examples packed into it.

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [13]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) -> (4, 8, 65)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [14]:
# create a pytorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [15]:
batch_size = 32
for steps in range(10000): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

2.382369041442871


In [16]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


lso br. ave aviasurf my, yxMPZI ivee iuedrd whar ksth y h bora s be hese, woweee; the! KI 'de, ulseecherd d o blllando;LUCEO, oraingofof win!
RIfans picspeserer hee tha,
TOFonk? me ain ckntoty ded. bo'llll st ta d:
ELIS me hurf lal y, ma dus pe athouo
BEY:! Indy; by s afreanoo adicererupa anse tecorro llaus a!
OLeneerithesinthengove fal amas trr
TI ar I t, mes, n IUSt my w, fredeeyove
THek' merer, dd
We ntem lud engitheso; cer ize helorowaginte the?
Thak orblyoruldvicee chot, p,
Bealivolde Th li


# The math trick in self-attention

In [17]:
# Consider the following toy example

torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

### Averaging - Naive

In the above example, the tokens do not "speak" to each other. They are not coupled. Andrej said we would like to couple them in a very specific way. We would like a token a given position to contain information about the token that were before it. NOT after... because we trying to predict the token after it.

Andrej said the simplest (also the weakest) way to couple tokens in this way is to average them. So, a token at the 5th location is the average of all the token before it, and itself.

Like I said previously, averaging is the weakest form of coupling. The reason andrej gave for that. When averaging the feature vectors of each token, we lose spatial relationship information about those tokens.

Let's write it in Python

In [18]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B): # For each ROW in the batch 'B' dimension
    for t in range(T): # For each feature vector in the time 'T' dimension
        xprev = x[b,:t+1] # (t,C) xprev is all tokens feature vector from the beginning up to including the current token feature vector. 
        xbow[b,t] = torch.mean(xprev, 0) # average of the 0th dim. from top to bottom.

In [19]:
x[0] # the first batch

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [20]:
xbow[0] # the first batch

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

Each line (i.e. feature vector) in `xbow` is the average of all of the previous lines including itself.

### Averaging - Matrix multiplication

We can repeat the same operation much more efficiently and a better using matrix multiplication. Let's consider another toy example.

In [21]:
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


The number `14` in `c` is acheived by computing the dot product of the first row of `a` and the first column of `b`. But since the first row of `a` is all ones, the dot product is then the sum of all the all the elements in the first column of `b`... $2 + 6 + 6 = 14$. The same happens everywhere. But we can do something better using the `torch.tril` function.

In [22]:
torch.tril(torch.ones(3, 3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

What happens if instead of just ones, we add the `torch.tril` method

In [23]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


What does `c` contains? The dot product... yes it's obvious.

But, let's examine its first element, '2'. This result comes from multiplying the first row of 'a' by the first column of 'b' (dot product). However, during this process, some elements in the sum are zeroed out, leaving '2' on its own. The same process applies to the computation of '7'.

Now, let's carefully analyze each row of `c`:

- The first row in `c` is simply the sum of the first row of `b` since there's nothing before it to add.

- The second row in `c` results from summing the second row of '`b`' with the first row of `b`.

- The third row in `c` comes from adding the third row of '`b`' to all the preceding rows of `b`.

In essence, each row in `c` is derived by adding the current row to all the rows that come before it.

The example above is a sum, but we can also do averages! We can get averages by normalizing the rows of `a` such that sum up to `1`.

In [24]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Now the second row of `c` contains the average of the first two rows of `b`. That's very convenient. So, let's use this to vectorized our initial example.

In [25]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C) (Broadcasting happens here too)

torch.allclose(xbow, xbow2) # Both are equal

True

In [26]:
# Printing the first batch of xbow and xbow2. 
# They are identical. Using the for loop, and matrix multiplication gets us to the same results.
print(xbow[0])
print('\n')
print(xbow2[0])

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


### Averaging - Matrix multiplication and Adding Softmax

Andrej decided to write this in a third new way. Using softmax.

In [27]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T)) # Low triangular matrix of ones.
wei = torch.zeros((T,T)) # all zeroes
wei = wei.masked_fill(tril == 0, float('-inf')) # For all elements where tril is 0, make them -inf
wei = F.softmax(wei, dim=-1) # take the softmax across each row of wei. 
# REMEMBER: softmax normalizes as well. It first exponientiate each element in the row. Where the element is 0
# you get a one. Where the element is -inf, you get a zero. And then softmax will divide each of those result
# by a sum, effectively normalizing the elements in the row.


xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

Andrej gave the reason why this last version is interesting for us. This `wei` matrix though starting with zeros tells us **how much of each previous token do we want to aggregate in the current token**. `wei` also represents the affinity between tokens.

# The CRUX of the video: Self-attention

This section Andrej said is the most important part of the video. We are going to implement a small self-attention block for a single "head". Let's start off where we were:

In [28]:
# version 4: self-attention!

torch.manual_seed(1337)
B, T, C = 4, 8, 32 # batch, time, channels
x = torch.randn(B, T, C)

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T)) # initialize the 'affinity' between the tokens to zero
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) #  This line applies the softmax function along the last dimension (dim=-1, which is the row dimension) of the wei matrix
out = wei @ x

out.shape

torch.Size([4, 8, 32])

What the code does is that, for each token *in a batch element*, a simple average of the past including the current one. All this information is being mixed together in an average.

But `wei` is supposed the affinity between tokens. And right now, it's initialized to zeros. We want those values to be data dependent. That's what self-attention solves. 

Every single token, will emit two vectors... a query and a key. The query vector represents what we looking for, and the key vector is what we contain. The way we get affinities between tokens in a sequence is **by doing a dot product between the keys and queries**. The query of a token dotted with all the keys of all the other tokens. This dot product, Andrej said, is `wei`.

Let's implement this:

In [37]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
# At this point all the tokens in x, have produced a "key" and a "query" vector in parallel.
# NO communication have happened yet.

# Now, all the queries will dot product with the keys of all other tokens including itself.
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# At this point, 'wei' contains the attention scores. The 'affinity' of all the tokens among
# themselves. That's why it's (T, T).

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) # 'Masked' Self-attention
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v # 'v' are the elements we aggregate NOT 'x'. 'x' is in a sense private to the token.
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [35]:
wei.shape

torch.Size([4, 8, 8])

The `wei` matrix tells us *how much of each token in the past to aggregate in the current token*. The `wei` matrix contains the attention scores, and it has the shape $(4 \times 8 \times 8)$. In other words, `wei` contains **four $(8 \times 8)$ matrices** OR **ONE** $(8 \times 8)$ matrix for **EACH** sentence in the batch. 

Remember that when we matrix multiply `wei` and `v`, we are computing a weighted sum of the vectors representing tokens in the sentence. Meaning the attention scores in `wei` scale the values in the `v` matrix when computing the dot product. Those values tell us how much of information of previous tokens in the sentence we should aggregate in the current token. And a sentence is represented by a $(T \times head\_size)$ matrix in `v` which means a sentence is represented by the a $(8 \times 16)$ matrix. Since we have $B$ sentences in a batch, the `v` matrix is of shape $(4 \times 8 \times 16)$. In other words, `v` contains FOUR $(8 \times 16)$ matrices.

So matrix multiplying `wei` and `v` means:

- The first $(8 \times 8)$ in `wei` is multiplied with the first $(8 \times 16)$ matrix in `v`.
- The second matrix $(8 \times 8)$ in `wei` is multiplied with the second $(8 \times 16)$ matrix in `v`.
- etc... For each batch dimension.

Once again for each matrix multiplications, we are aggregating (computing a weighted sum) the tokens vectors in the `v` matrix.

Let's consider what happens during one of those aggregation. Let's consider the first matrix the case where the first matrix in `wei` is multiplied with the matrix in `v`. The result of this operation is another matrix where

- The first vector (i.e. row) is JUST a weighted sum of itself. 
- The second vector (i.e. row) is a weighted sum of itself AND the first vector.
- ...
- The last vector (i.e row) is a weighted sum of itself AND all the previous vectors in the sequence.

So really, the matrix `out` just contains the results of those aggregations batched together. As a result, it has the shape $(4 \times 8 \times 16)$ just like matrix `v` where $4$ is the Batch, $8$ is the Time, and $16$ is the Channel dimension. In a SINGLE self-attention head.

> But why are we doing `wei @ v` and not `wei @ x`? In other words, why are multiplying the 'value' vector and not the input directly?

Well, Andrej said, when performing the matrix multiplication we are not doing that on the token themselves. But rather, on the `value` of the token. This `value` (or `v`) is obtained by passing `x` through a linear layer. So, Andrej said one may look at `x` as some kind of private information about the token. 

But also, if we are multiple attention "heads" you can re-use the `x`. And it made sense to me.

That's self-attention in a single head 🙂🤷🏾‍♂️. The code above implements the following formula from the paper:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

NOTES: (From Andrej's Colab)

- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [57]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [58]:
k.var()

tensor(1.0652)

In [59]:
q.var()

tensor(0.9575)

In [60]:
wei.var()

tensor(0.9137)

In [61]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [62]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # When values in softmax takes extreme values, it converges to one-hot. It will sharpen to the values that is the highest. It'll become too peaky. This is not desirable, especially at initialization.

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

# Wait, Let's talk about the $Q, K, V$ matrices a little more...

Despite Andrej's explanation, I was still trying wrap my head around those concepts, it wasn't clear to me. I feel like I had a breakthrough in my understanding. The following is my attempt at putting my understanding in words. It's not that different from what Andrej expressed, but for some reason... I get it a little better. Maybe because I let my brain 🧠 cook, and it did its thing... I don't know.

I knew that the dot product between the Queries, and the Keys matrices produced the attention scores matrix. When this attention matrix is dotted with the value, it's like we are aggregating values in the value matrix. That's why I kept in my head after hearing Andrej. I came across the Information retrieval analogy when talking about $Q, K, V$ matrices multiple times, but I did not see the link. To me they were just matrices of numbers, so why naming them like this?

That's where I started rehash Andrej's words in my way.

- Each token in the input sequence generates (or emits) its own "*Query*" vector, which represents the token's **representation** *of what it is looking for*. In the context of sequence modeling it represents the kind of token, the current one is looking for or wants to attend. The collection of "Query" vectors is stored in the "Queries" matrix or $Q$.

- Each token in the input sequence also generates a "*Key*" vector, which represents a token's identifier against which the queries will be matched. Those vectors are, in a way, how a token a "chooses" to advertise itself to others in tokens in the sequence. They contain information that helps the model understand the relationship between different tokens in the sequence. The collection of "Key" vectors is store in the "Keys" matrix or $K$.

> **Keys and queries are multiplied together to compute the attention scores**, indicating how much focus or importance should be placed on different parts of the input sequence. Another way to put this is through this process, the query vector of a given token is dotted with the keys of all other tokens in the same sentence (self-attention). If the dot product between a query and a key is high, then it means that the token will "attend" (or "pay attention") to the other token... There is affinity between the two.

- Each token also emits a "*Value*" vector, which this time, is a representation of itself. "Value" vectors represent actual information or content associated with each token in the sequence. This is information is private to each attention heads. In the context of masked self-attention, token are allowed to look at (i.e. attend) past tokens. So, when the matrix containing attention scores is multiplied with the "Value" vectors of a sequence of tokens, remember that a weighted sum is happening, meaning "Value" vectors of previous tokens are aggregated in the current token's "Value" vector. The attention scores determine **how much of each "value" vector to aggregate** in the current token.

> Wait $Q, K, V$ matrices are randomly initialized, right? How come they contain some kind of representation that happens to be what a token is looking for, or how a token advertizes itself to its peers, etc.?

Lol, they "learn". Really... 😆

Through forward & backward passes those tokens "learn" to emit better queries and keys vectors resulting in better attention scores. What I mean is that to emit those vectors, a token vector goes through a fully connected network (i.e. MLP). If the representation happens to be bad, which will likely be the case early during training, it will impact the network's overall performance. When the network updates its weights through backpropagation, the weights of those MLPs are also updated, resulting in MLPs able to produce better and better "queries", "keys", and "values" vectors.

Another way to put this is that through those forward & backward passes, tokens not only learn what to « look for » (queries), but also what to « present » to other tokens (keys). The result is that during training a token might attend another one. But after few epochs later, the same token may start attending another token… why? Because the network updated itself in a way that results in a smaller loss, so better predictions.

# Transformer Architecture Implementation

I would like to describe each part of [NanoGPT](https://github.com/karpathy/nanoGPT) the Andrej's implementation of the GPT model.