In [105]:
with open('../input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
print(text[:150])
print(f"Length of text: {len(text)}")

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

A
Length of text: 1115394


In [106]:
chars = sorted(list(set(text)))
n_vocab = len(chars)


print("".join(chars))
print(f"Vocabulary size: {n_vocab}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65


### Tokenizer

We have a custom tokenizer, its a character level tokenizer for the sake of simplicity

Some popular tokenizers includes tiktoken (byte pair encoding), sentencepiece (sub word unit encoding)

The above stated tokenizers have very large vocabulary (~50k tokens) but this results in much smaller sequences

in our case the char level token has only 65 tokens so the resulting sequence will be a one to one mapping of each character and length of sequence will scale linearly (which is bad)

> TODO use one of the popular tokenizers later while implementing to see the difference

In [107]:
char2idx = { ch: i for i, ch in enumerate(chars) }
idx2char = { i: ch for i, ch in enumerate(chars) }

encode = lambda string: [char2idx[char] for char in string]
decode = lambda tensor: "".join([idx2char[idx] for idx in tensor])

print(encode("hii there"))
print(decode(encode("hii there")))


[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenize the dataset

In [108]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])
print(decode(data[:100].tolist()))

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### Train - Validate Split

In [109]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(len(train_data), len(val_data))

1003854 111540


### hyperparameters

`block_size`

> we train the transformer on the above dataset as chunks, feeding in the entire dataset at once would be too computationally expensive, so we ranomly sample "chunks" of sequences from the dataset and train on them. The length of this sampled sequence is determined by block_size

`n_vocab`

> length of vocabulary, vocabulary is basically the number of unique tokens that our transformer will see and generate

In [110]:
block_size = 8
seed = 1337
batch_size = 4
n_embedding = n_vocab

In one of these sequences, there are multiple examples packed in it. in a sequence of length 8 there are 8 unique training examples

as such the `+1` is to accomodate a `y` for the last training sample, since `y` starts at an offset of `+1`

### Note

> The reason why multiple training samples are taken from a single sequence ranging from `1 - block_size` is not just to make it computationally efficient but to get the transformer used to seeing sequences of length in that range. `block_size` is essentially the `context_length` in transformers. During generation as well, when we keep appending generated tokens and during the next forward pass the transformer only sees the last `block_size` tokens

In [111]:
x = train_data[:block_size]
y = train_data[1:block_size + 1]

print(decode(x.tolist()))
print(x, y)

for t in range(block_size):
    context = x[:t + 1]
    target = y[t]
    
    print(f"when input in: {context} the target: {target}")

First Ci
tensor([18, 47, 56, 57, 58,  1, 15, 47]) tensor([47, 56, 57, 58,  1, 15, 47, 58])
when input in: tensor([18]) the target: 47
when input in: tensor([18, 47]) the target: 56
when input in: tensor([18, 47, 56]) the target: 57
when input in: tensor([18, 47, 56, 57]) the target: 58
when input in: tensor([18, 47, 56, 57, 58]) the target: 1
when input in: tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input in: tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input in: tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [112]:
torch.manual_seed(seed)

def get_batch(split):
    
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    
    return x, y
    
xb, yb = get_batch('train')
print(f"inputs, {xb.shape}")
print(xb)
print(f"targets, {yb.shape}")
print(yb)

print('-' * 40)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context}, target is {target}")


inputs, torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets, torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----------------------------------------
when input is tensor([24]), target is 43
when input is tensor([24, 43]), target is 58
when input is tensor([24, 43, 58]), target is 5
when input is tensor([24, 43, 58,  5]), target is 57
when input is tensor([24, 43, 58,  5, 57]), target is 1
when input is tensor([24, 43, 58,  5, 57,  1]), target is 46
when input is tensor([24, 43, 58,  5, 57,  1, 46]), target is 43
when input is tensor([24, 43, 58,  5, 57,  1, 46, 43]), target is 39
when input is tensor([44]), target is 53
when input is tensor([44, 53]), target is 56
when input is tensor([44, 53, 56]), target

### Language Model

For the sake of simplicity we use the simplest form of neural network, the bigram language model

### Note

> idx has a shape of `(B, T)`. batch, time dimensions respectively

> output has a shape of `(B, T, C)`. where `C` is the embedding dimension

How output becomes that shape is basically, each token idx has an associated `(65, )` dimensional vector in the embedding table, since there are 8 tokens in a sequence (block size), there will be a corresponding embedding vector for each of those tokens. this is done for all sequences in the batch (4 sequence in a batch)

In [113]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(seed)

class BigramLangugeModel(nn.Module):
    
    def __init__(self, n_vocab, n_embedding):
        super().__init__()
        
        self.token_embedding_table = nn.Embedding(n_vocab, n_embedding)
        
    def forward(self, idx, targets=None):
        logits: torch.Tensor = self.token_embedding_table(idx)
        
        # logits is of shape (B, T, C) however cross entropy loss expects (B, C, T)
        
        if targets == None:
            loss = None
            
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx: torch.Tensor, max_new_tokens):
        
        for _ in range(max_new_tokens):
            logits, loss = self(idx) # (B, T, C)
            # since bigram language model, we only care about the 
            # token at previous time step
            logits = logits[:, -1, :] # last time step -1
            
            # softmax to calculate probabilities along the rows (time dimension)
            probs = F.softmax(logits, dim=1)
            
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) generate next token for each batch element
            
            # append predicted token to running sequence
            idx = torch.cat([idx, idx_next], dim=1) # (B, T + 1) add new token to each sequence in the batch
        return idx
    
m = BigramLangugeModel(n_vocab, n_embedding)
logits, loss = m(xb, yb)
print(logits.shape, loss.shape)

print(logits)
print(loss)
    
    
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), 100)[0].tolist()))

torch.Size([32, 65]) torch.Size([])
tensor([[-1.5101, -0.0948,  1.0927,  ..., -0.6126, -0.6597,  0.7624],
        [ 0.3323, -0.0872, -0.7470,  ..., -0.6716, -0.9572, -0.9594],
        [ 0.2475, -0.6349, -1.2909,  ...,  1.3064, -0.2256, -1.8305],
        ...,
        [-2.1910, -0.7574,  1.9656,  ..., -0.3580,  0.8585, -0.6161],
        [ 0.5978, -0.0514, -0.0646,  ..., -1.4649, -2.0555,  1.8275],
        [-0.6787,  0.8662, -1.6433,  ...,  2.3671, -0.7775, -0.2586]],
       grad_fn=<ViewBackward0>)
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


### Training Bigram Model

In [114]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
epochs = 100
batch_size = 32

for epoch in range(epochs):
    xb, yb = get_batch('train')
    logits, loss = m(xb, yb)
    loss.backward()
    optimizer.step()
print(loss.item())
    
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), 500)[0].tolist()))


4.509823799133301

xiKi-RJ:COpVuUa!U?qMH.uk!sCuMXvv!CJFfx;LgRyJknOEti.?I&-gPlLyulId?XlaInQ'q,lT$
3Q&sGlvHQ?mqSq-eON
x?SP fUAfCAuCX:bOlgiRQWN:Mphaw
tRLKuYXEaAXxrcq-gCUzeh3w!AcyaylgYWjmJM?Uzw:inaY,:C&OECW:vmGGJAn3onAuMgia!ms$Vb q-gCOcPcUhOnxJGUGSPJWT:.?ujmJFoiNYWA'DxY,prZ?qdT;hoo'dHooXXlxf'WkHK&u3Q?rqUi.kz;?Yx?C&u3Qbfzxlyh'Vl:zyxjKXgC?
lv'QKFiBeviNxO'm!Upm$srm&TqViqiBD3HevijuEOpmZJyF$Fwfy!PlvWPFC
&WDdP!Ko,px
x
tREOE;AJ.BeXkylOVD3KHp$e?nD,.SFbWWI'ubcL!q-tU;aXmJ&uGXHxJXI&Z!gHRpajj;l.
pTErIBjx;JKIgoCnLGXrJSP!Ac-rdbczR?


### Mathematical Trick in self-attention

In [115]:
torch.manual_seed(seed)

B, T, C = 4, 8, 2
x = torch.rand(B, T, C)
x[0]

tensor([[0.0783, 0.4956],
        [0.6231, 0.4224],
        [0.2004, 0.0287],
        [0.5851, 0.6967],
        [0.1761, 0.2595],
        [0.7086, 0.5809],
        [0.0574, 0.7669],
        [0.8778, 0.2434]])

using simple aggregation, we can simply average all channels of tokens in the past

It would become the feature vector that summarizes the particular token in the context of its previous tokens. however spacial arrangement information is lost, but for the sake of simplicity we can settle with this for now

In [116]:
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t + 1] # (t, C)
        xbow[b, t] = torch.mean(xprev, dim=0)
xbow

tensor([[[0.0783, 0.4956],
         [0.3507, 0.4590],
         [0.3006, 0.3156],
         [0.3717, 0.4108],
         [0.3326, 0.3806],
         [0.3953, 0.4140],
         [0.3470, 0.4644],
         [0.4134, 0.4368]],

        [[0.6005, 0.7079],
         [0.5554, 0.5572],
         [0.6657, 0.4908],
         [0.7234, 0.6090],
         [0.5817, 0.6344],
         [0.6161, 0.6865],
         [0.5946, 0.7081],
         [0.5362, 0.6462]],

        [[0.2944, 0.3677],
         [0.3887, 0.5215],
         [0.4333, 0.5322],
         [0.4675, 0.4611],
         [0.4948, 0.5105],
         [0.4255, 0.5525],
         [0.4748, 0.4842],
         [0.4875, 0.5094]],

        [[0.9100, 0.7684],
         [0.8118, 0.4135],
         [0.7959, 0.5978],
         [0.8454, 0.6482],
         [0.6993, 0.6530],
         [0.5973, 0.6449],
         [0.5726, 0.5670],
         [0.5115, 0.5632]]])

using matrix multiplication for weighted aggregation for computational efficiency

In [117]:
a = torch.tril(torch.ones(3, 3))
a /= a.sum(dim=1, keepdim=True)
b = torch.randint(0, 10, (3, 2), dtype=torch.float)

print(a)
print(b)

print(a @ b)

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
tensor([[8., 6.],
        [5., 2.],
        [4., 4.]])
tensor([[8.0000, 6.0000],
        [6.5000, 4.0000],
        [5.6667, 4.0000]])


In [118]:
wei = torch.tril(torch.ones(T, T))
wei /= torch.sum(wei, dim=1, keepdim=True)
xbow2 = wei @ x

torch.allclose(xbow2, xbow)

True

### version 3

Rewriting the above logic

now lets look at what these stuff actually means

> `wei` - you can think of this as the interaction strength/affinity score of each token, each value in a row following up to `n-th` element would tell us how much information is flowing from `0 - n` elements to `n`

> `tril` - is used in masked_fill to make sure that a particular token at time-step t only interacts with its preceeding elements, we don't want the token to look into the future tokens. This is the core difference between an encoder and a decoder

> `softmax` - is a way to normalize

Now for this dummy case, the "Affinity scores", `wei` was initialized to 0, but in practice each of these time-steps would have different affinity scores

In [119]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros(T, T)
wei = wei.masked_fill(tril == 0, float('-inf'))
print("before softmax\n", wei)
wei = F.softmax(wei, dim=-1)
print("after softmax\n", wei)

out = wei @ x
print(f"output {out.shape}\n", out)

before softmax
 tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
after softmax
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0

### version 4

self attention

Now the issue with version 3 is that, we're doing a simple average of all the preceeding tokens
this is because the affinity, `wei` is 0, the same for every token.

so the resultant `wei` after normalization is uniform and basically holds equal importance. But this is not the case in real life. because different tokens will find different other tokens more or less interesting, they have to be data dependent 

self attention solves this issue

#### How self-attention solves this

Every single token at each position emits two different vectors

> `Query (Q)` - tells us what the token is looking for 

> `Key (K)` - tells us what the particular token contains

> `Value (V)` - encodes what each token wants to share with others 

> `out` - contextualized (context aware) representation of each token: a weighted sum (Aggregation) of other tokens' values, where the weights are given by `wei`

Affinity between tokens in a sequence is achieved by performing a dot product between `Q` and `K`

The dot product between the `Q` of a particular token and `K` of every other token then gives us `wei`

In [120]:
torch.manual_seed(seed)

B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# lets see how a single head performs self-attention

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)

# now interact all query with every other keys

wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# we dont directly aggregate instead use another component `Value`
# x holds the identity of the token
# v basically tells us what information each token will communicate with us
# out = wei @ x

v = value(x)
out = wei @ v
out.shape

torch.Size([4, 8, 16])

not how the elements are not uniform, each of them have different affinity scores

In [121]:
out

tensor([[[-1.5713e-01,  8.8009e-01,  1.6152e-01, -7.8239e-01, -1.4289e-01,
           7.4676e-01,  1.0068e-01, -5.2395e-01, -8.8726e-01,  1.9068e-01,
           1.7616e-01, -5.9426e-01, -4.8124e-01, -4.8598e-01,  2.8623e-01,
           5.7099e-01],
         [ 6.7643e-01, -5.4770e-01, -2.4780e-01,  3.1430e-01, -1.2799e-01,
          -2.9521e-01, -4.2962e-01, -1.0891e-01, -4.9282e-02,  7.2679e-01,
           7.1296e-01, -1.1639e-01,  3.2665e-01,  3.4315e-01, -7.0975e-02,
           1.2716e+00],
         [ 4.8227e-01, -1.0688e-01, -4.0555e-01,  1.7696e-01,  1.5811e-01,
          -1.6967e-01,  1.6217e-02,  2.1509e-02, -2.4903e-01, -3.7725e-01,
           2.7867e-01,  1.6295e-01, -2.8951e-01, -6.7610e-02, -1.4162e-01,
           1.2194e+00],
         [ 1.9708e-01,  2.8561e-01, -1.3028e-01, -2.6552e-01,  6.6781e-02,
           1.9535e-01,  2.8073e-02, -2.4511e-01, -4.6466e-01,  6.9287e-02,
           1.5284e-01, -2.0324e-01, -2.4789e-01, -1.6213e-01,  1.9474e-01,
           7.6778e-01],
    

In [122]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In a **T x T attention score matrix** for a single batch, where **T** is the sequence length or block size, the matrix represents the attention scores computed during the self-attention mechanism (e.g., in a Transformer model). Here's what each component represents:

- **Rows**: Each row corresponds to a **query token** in the sequence. For a sequence of length **T**, there are **T** rows, where the \(i\)-th row represents the attention scores for the \(i\)-th token's query vector attending to all tokens (including itself).

- **Columns**: Each column corresponds to a **key token** in the sequence. The \(j\)-th column represents the contribution of the \(j\)-th token's key vector to the attention scores for all queries.

- **Elements**: Each element \(A_{i,j}\) in the matrix represents the **attention score** between the \(i\)-th query token and the \(j\)-th key token. This score indicates how much the \(i\)-th token attends to the \(j\)-th token when computing its output representation. Typically, these scores are computed as:
  \[
  A_{i,j} = \text{score}(Q_i, K_j) = \frac{Q_i \cdot K_j}{\sqrt{d_k}}
  \]
  (for scaled dot-product attention, before softmax), where \(Q_i\) is the query vector for the \(i\)-th token, \(K_j\) is the key vector for the \(j\)-th token, and \(d_k\) is the dimension of the key vectors. After softmax, \(A_{i,j}\) represents the normalized attention weight.

### Summary:
- **Row \(i\)**: Attention scores for the \(i\)-th token's query attending to all tokens.
- **Column \(j\)**: Contribution of the \(j\)-th token's key to all queries.
- **Element \(A_{i,j}\)**: Attention score/weight for how much the \(i\)-th token attends to the \(j\)-th token.

If you need further details (e.g., about softmax or masking), let me know!