In [1]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [3]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


We are building character level model

Vocab Preparation

In [4]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


# Tokenization

Very simple!, just using simple integers to map these characters

In [5]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


We have build very simple tokenizer,

Google uses SentencePiece (sub-word units)

OpenAI uses tiktoken.

let's now encode the entire text dataset and store it into a torch.Tensor

In [6]:
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:10]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])


## Let's now split up the data into train and validation sets

In [7]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

We are not going to feed all data at once, that would be computational expensive.

Instead we going to train on chunk of data which we'll create from our original data.


In [8]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In the block size of 8 (i.e, 9 characters), there are 8 examples individual packed in there.

example:

In [9]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


One reason to do this is that we want the transformer network to be used to see all the way from one to all the way up to block size.

We are done with time dimension, now let's handle batch dimension. We feed the data into the network in mini-batch just to keep GPUs busy, as they are good at parallel processing. Also each example in batch doesn't talk to each other.

In [10]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

in input of 4 x 8 there are total of 32 examples packed in

First, let's build very simple bigram model using NN. We will have simple embedding weight and will train it and use it for prediction.

Idea is very simple, we are not looking any context for prediction, we are saying every character has got something in it. By looking at a single character (that is itself), we can predict what follows next.

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [12]:
torch.manual_seed(1337)

n_embd = 32 # added extra linear now
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        tok_emd = self.token_embedding_table(idx) # (B,T,C)
        logits = self.lm_head(tok_emd) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self.forward(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [13]:
model = BigramLanguageModel(vocab_size)
logits, loss = model.forward(xb, yb)

print(model)
print(loss)

BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)
tensor(4.4922, grad_fn=<NllLossBackward0>)


In [14]:
# geneate from the model
print(decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


hYQRnbbmkMTUwbiu$?3KHvybsMEEFNLyb!SZgyGzRX$oNqTs!roUNLjMXM!EjT!hjmfH'ER3cOn.kvgAuau&e;m-CNLkfMW HT'R


# Adam vs AdamW?

Adam:
    Adam is an optimization algorithm that adapts the learning rate for each parameter. It’s an optimizer that combines:

    Momentum: A running average of past gradients.
    RMSprop: A running average of the squared gradients.

## What's the Weight Decay Problem in Adam?

🚨 Common practice before:

People used to implement L2 regularization by modifying the gradient:

```g_t += λ * θ_t```

This worked well with SGD, But Adam doesn't update directly with g_t — it scales the gradient with m_t, v_t etc.
So, if you add weight decay to the gradient, it gets distorted

This means:
- Weight decay is now mixed with the adaptive scaling, which ruins the intended uniform shrinking effect.
- So some weights shrink more, others less.
- This leads to bad generalization, especially in large models (like Transformers).

## How AdamW fixes this?

The solution proposed in AdamW paper (2019) is:
https://arxiv.org/abs/1711.05101

**Decouple weight decay from the gradient update.**

The weight decay step is independent of the gradient.

It uniformly shrinks parameters by a fixed ratio:

This keeps the regularization effect consistent, regardless of how gradients behave.

In [15]:
# let's train our model
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [16]:
batch_size = 32
for _ in range(1000):
    # sample a batch of data
    xb, yb = get_batch('train')

    # forward pass
    logits, loss = model(xb, yb)

    # zero all of the gradients
    optimizer.zero_grad(set_to_none=True)

    # backward pass
    loss.backward()

    # update
    optimizer.step()

print(loss.item())

2.5488028526306152


In [17]:
# geneate from the model
print(decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=300)[0].tolist()))


Tpour thea!
She b.
Worr;
paet wh p'mou hof cod ive fondean hemen ossNCellld I beve ais
Pice w; hevees hinean cofo atralorine, mhl t t and pninsy, yo I ves se
Pind ays,ORhaipes t twaty e;
Whan sirloworth?
Mre, h ofise,
CThed
Whsplty;
Ktor sear f yy bd :
ABndou, beu


AYer'd spatus soo me I,
; cngoutl


Certainly not Shakespeare, but the model is making progress. This is the simplest model, because the tokens are not talking to each other.

Now we want that token to talk to each other and figure out what's in context to make a better prediction.



Each token at time t should only "see" tokens from time ≤ t (no peeking into the future).

A naive way: average all previous embeddings up to the current token.

# The mathematical trick in self-attention

In [18]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [19]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


In [20]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [21]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)


In [22]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x


### Masking with Softmax (for attention weights)
Instead of hard-coded ones in the lower triangle, allow learned attention weights.

Initialize scores (e.g., all zeros), mask future with -inf, then apply softmax:

# Why Softmax?
Allows dynamic (data-dependent) attention:

Instead of uniform averaging, future models will compute affinities between tokens.

Attention scores dictate how much each token cares about previous ones.

Future tokens are still masked to preserve causality.

Previously, we used a lower-triangular mask to perform uniform averaging over past tokens.

That was a static mechanism: each token equally weighted previous ones

# 🧠 Now the Goal:
Make the weighting data-dependent → certain tokens should attend more or less to others based on their content.

# Core Idea: Queries, Keys, and Dot Products
### 1. Every token emits 3 vectors:

Query 𝑄: Q → what I'm looking for

Key 𝐾: K → what do I contain

Value 𝑉: V → what I want to share

Each of these is computed as a linear projection of the input:
```
k = self.key(x)    # B x T x head_size
q = self.query(x)  # B x T x head_size
v = self.value(x)  # B x T x head_size
```

### 2. Affinity Matrix (Weights):
Dot product of each query with all keys

```
weights = q @ k.transpose(-2, -1)  # Shape: B x T x T
```

### 3. Causal Masking:
Prevent future tokens from being seen (for language modeling):
```
weights.masked_fill(triu_mask == 0, float('-inf'))
```

### 4. Softmax Normalization:
Convert affinities to probabilities:
```
weights = F.softmax(weights, dim=-1)
```

### 5. Weighted Aggregation:
Multiply attention weights by values:
```
output = weights @ v  # Shape: B x T x head_size
```
Each token aggregates information from previous tokens based on attention scores.

Vectors v are what's actually communicated—not raw x.

In [23]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

Here are the expanded and structured notes for the final part of the explanation, incorporating your bullet points with clear elaboration, visual intuition, and technical depth:

---

## 📌 Key Concepts Recap: Attention as Communication

---

### 🧠 Attention as Communication
- Think of each token as a **node in a directed graph**.
- Each node:
  - Holds its own vector (`x`)
  - Computes a **weighted sum** of other nodes' values (`v`)
  - Weights (affinities) are **data-dependent**, computed via dot product of `q` and `k`.

> ⚠️ Not fixed topology—data decides connectivity!

---

### 🧭 No Notion of Space
- Attention is **permutation invariant** by default.
- It treats all input vectors as a **set**, not a sequence or grid.
- That’s why we **must add position information** explicitly:
  ```python
  x = token_embedding + positional_embedding
  ```

---

### 🎒 Batch-Wise Isolation
- Each input in the **batch dimension** is processed **independently**:
  - No information is shared across batch entries.
  - Each batch behaves like a **separate graph** of tokens.

> Example: If batch size B = 4 and sequence length T = 8 → We have **4 disjoint 8-node graphs**.

---

### 🔁 Encoder vs Decoder Attention Blocks

| Component       | Encoder Block                          | Decoder Block                          |
|----------------|----------------------------------------|----------------------------------------|
| Masking        | ❌ No masking – full attention          | ✅ Masked – only attends to past tokens |
| Communication  | Bidirectional (all tokens see all)     | Autoregressive (future is masked)      |
| Application    | BERT, T5 encoder                       | GPT, T5 decoder                        |
| Masking Code   | *(none)*                               | `weights.masked_fill(triu == 0, -inf)` |

- **Decoder block** masks future tokens using `torch.tril` (lower triangular mask).
- **Encoder block** removes that line to allow all-to-all attention.

---

### 🔄 Self-Attention vs Cross-Attention

| Aspect         | Self-Attention                             | Cross-Attention                               |
|----------------|--------------------------------------------|------------------------------------------------|
| Queries        | Computed from input `x`                    | Computed from input `x`                        |
| Keys & Values  | Also computed from same input `x`          | Come from a different source (e.g., encoder)   |
| Application    | Language modeling, understanding           | Encoder-decoder models (e.g., translation)     |

> 🧩 Cross-attention is what lets a decoder attend to encoder outputs in translation.

---

## 🎯 Scaled Dot-Product Attention

- Original affinity scores are computed as:
  ```python
  weights = q @ k.transpose(-2, -1)
  ```

- But as vector size increases, these dot products can grow large, leading to **softmax saturation** (only one element dominates).
- Solution: **scale down the dot products**:
  ```python
  weights /= math.sqrt(head_size)
  ```

### 📐 Why scaling helps:
- Assume `q` and `k` have zero mean and unit variance.
- Then `q · k` will have variance ≈ `head_size`.
- Dividing by `sqrt(head_size)` ensures:
  - Dot product has **unit variance**
  - Softmax stays **stable and diffuse**, improving training dynamics

---

### 📈 Illustration of Scaling Effect (conceptual)

| Dot Product (Before Scaling) | After Scaling by √d | Softmax Output |
|------------------------------|---------------------|----------------|
| [3.2, 2.1, 0.7]              | [0.8, 0.5, 0.2]     | [0.46, 0.35, 0.19] |
| [10, 5, -3]                  | [2.5, 1.25, -0.75]  | [0.72, 0.23, 0.05] |

- Without scaling: softmax becomes too **peaky**
- With scaling: more **balanced**, less saturated gradients

---

## ✅ Final Summary

| Concept                | Core Idea                                                                 |
|------------------------|---------------------------------------------------------------------------|
| Attention              | Communication between tokens via data-driven weighted aggregation         |
| Queries & Keys         | Used to compute how "interested" tokens are in each other                 |
| Values                 | Actual information that gets passed/aggregated                            |
| Causal Masking         | Prevents tokens from accessing the future in language generation          |
| Scaling                | Keeps softmax stable by normalizing dot product magnitudes                |
| Self vs Cross Attention| Whether Q, K, V are from the same or different sources                    |
| Encoder vs Decoder     | Whether all tokens talk, or only attend to previous tokens                |

---

Would you like me to turn this into a visual flowchart or diagram next?

## 🧮 Scaled Dot-Product Attention

### Why scale?
- Prevent variance of attention weights from **blowing up** as the head size grows.
- Without scaling:
  - If Q, K are unit Gaussian, `Q·Kᵗ` will have variance ∝ `head_size`.
  - This can cause **softmax saturation** → output becomes near one-hot.
- With scaling:
  ```python
  scores = (Q @ K.transpose(-2, -1)) / sqrt(head_size)
  ```

### Effect on Softmax:
- Keeps softmax output **diffuse** and prevents **gradient vanishing or explosion**.
- At initialization, we want attention to be soft and spread, not overly focused.

> **Takeaway**: Scaling by √d is essential for **stable training** of attention models.


In [24]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [25]:
k.var()

tensor(1.0449)

In [26]:
q.var()

tensor(1.0700)

In [27]:
wei.var()

tensor(1.0918)

In [28]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [29]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


## 🧱 Implementation of Self-Attention Head

### `Head` class (in PyTorch):
- Takes `head_size` as input
- Creates:
  - Linear layers for `key`, `query`, `value` (no bias)
  - Lower-triangular mask (`torch.tril`) registered as a buffer
- Forward pass includes:
  - Key, Query, Value projections
  - Dot-product attention (with scaling)
  - Masking (for decoder block behavior)
  - Softmax + value aggregation

## 🔗 Multi-Head Attention

### Why multiple heads?
- Allows **parallel, independent attention channels**.
- Different heads can learn **different types of dependencies**:
  - Some may focus on vowels, others on specific positions, etc.

### Implementation:
- Create multiple `Head` instances in a list:
  ```python
  self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
  ```
- Forward:
  ```python
  x = torch.cat([h(x) for h in self.heads], dim=-1)
  ```

### Head Size:
- If `n_embed = 32` and `num_heads = 4`, then each head has `head_size = 8`.

> Think of it like **grouped convolutions**, where each group processes different channels independently.

### Training Outcome:
- Validation loss improved: **2.4 → 2.28**
- Generation still rough, but better communication depth

---

## ⚙️ Feedforward Network (Per Token)

### Why?
- After tokens "talk" via attention, they should **process what they’ve heard**.
- Add computation per token independently.

### Implementation:
```python
self.ffwd = nn.Sequential(
    nn.Linear(n_embed, n_embed),
    nn.ReLU()
)
```

- Applied **after self-attention**, token-wise:
  ```python
  x = self.head(x)
  x = self.ffwd(x)
  ```

### Analogy:
- Attention = group discussion  
- Feedforward = individual thinking on what was heard

### Effect:
- Validation loss improved further: **2.28 → 2.24**

---

## 🧱 Transformer Block Structure

Each block:
1. **Multi-head self-attention**
2. **Feedforward network**
3. Both are **per token**, and often include **residuals and layer norm** (to be added later)

```text
+-------------------+
| Multi-head Attn   | ← Communication
+-------------------+
| Feedforward (MLP) | ← Computation
+-------------------+
```

> These blocks are **stacked multiple times** in actual Transformer models (e.g., 12 for GPT-2).


# Full Code

In [32]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# # wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# with open('input.txt', 'r', encoding='utf-8') as f:
#     text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5091, val loss 2.5060
step 300: train loss 2.4199, val loss 2.4338
step 400: train loss 2.3499, val loss 2.3561
step 500: train loss 2.2965, val loss 2.3129
step 600: train loss 2.2405, val loss 2.2495
step 700: train loss 2.2053, val loss 2.2194
step 800: train loss 2.1625, val loss 2.1853
step 900: train loss 2.1248, val loss 2.1513
step 1000: train loss 2.1035, val loss 2.1305
step 1100: train loss 2.0709, val loss 2.1193
step 1200: train loss 2.0375, val loss 2.0789
step 1300: train loss 2.0246, val loss 2.0646
step 1400: train loss 1.9936, val loss 2.0371
step 1500: train loss 1.9720, val loss 2.0318
step 1600: train loss 1.9629, val loss 2.0479
step 1700: train loss 1.9422, val loss 2.0139
step 1800: train loss 1.9096, val loss 1.9963
step 1900: train loss 1.9080, val loss 1.9884
step 2000: train loss 1.8837, val loss 1.9943
step 2100: train loss 1.

In [33]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))




I dace I tracius, the for of of an Xord abblences: by blolib'd it,
I wave blouds thy shall the catt,
Outy art of Bajoy block: piss; shat
And struse!
for mund for Jecemence bin-Seraker, wroth it eyetice;
I my neat's That make him pit blest in gon you art,
Setile all furse the and ver that bagling tilkand thos-loving;
Adwe show to heave be berer gland eyes with his youUh:
Shepp this as I will for a comestel but groust
sill king to you sim. Then worthherd my ascome, at
There boy's temppert bowh strall makin the spon.

EXTMillow,
BOLI mad it my ferer, and, not while her a scent.
I what the never; be lest that. 
DUKE Strear: well be kill kings down?
Brines Rome.

RICHENTESTEM:
O be you head! Let, goid nother, but boott, withse dimp you likery can of you lays lutumes of to const!
Go not upong toth stume till brath
As queepoty hang what me heart, of fail my kings!

Cotherd Whath'd, a him takes of excoudit?

LARENCE:
Not mus'd mather, you his me; and blove in he,
Let not be, I loved now, no 

## 🔄 Embedding, Head Count, and Head Size

- You define:
  - `n_embed = 32`
  - `n_head = 4`
- This implies:
  - `head_size = n_embed / n_head = 8`
- Ensures that when you **concatenate** outputs from all heads, you get back to `n_embed`.

> **Analogy**: Like group convolution — smaller communication blocks working in parallel.

---

## 🔁 Repeating Blocks (Communication + Computation)

- A Transformer is built by **stacking blocks**:
  - Each block = **Self-Attention** → **Feedforward**
- We now want to repeat:
  ```python
  Block
  Block
  Block
  ...
  ```

> Deep networks improve model capacity, but also introduce optimization difficulties.

---

## ⚡ Solution 1: Residual Connections (a.k.a. Skip Connections)

- Origin: **ResNet** (2015)
- Each block is wrapped with:
  ```python
  x = x + block(x)
  ```
- Helps in optimization:
  - Enables **gradient flow** directly to early layers
  - Prevents vanishing gradients
  - At initialization, blocks do almost nothing → clean gradient path
- Visualization:
  ```
  Input
    │
  ┌─▼─────────────┐
  │    Block      │
  └─▲─────────────┘
    │       │
    └─── + ─┘ → Output
  ```

---

## ⚡ Solution 2: Output Projection After Multi-Head Attention

- After concatenating multi-head outputs, project back to `n_embed`:
  ```python
  self.proj = nn.Linear(n_embed, n_embed)
  out = self.proj(concat_heads_output)
  ```
- Ensures shape compatibility with the **residual path**

---

## ⚙️ Scaling Feedforward Network Width

- Paper (Vaswani et al., 2017) suggests:
  - Inner dimension = 4 × `n_embed`
- Updated feedforward:
  ```python
  nn.Sequential(
      nn.Linear(n_embed, 4 * n_embed),
      nn.ReLU(),
      nn.Linear(4 * n_embed, n_embed)
  )
  ```

> Adds expressive power while maintaining the same output shape.

---

## 🧪 Results after Residual + Projection + Scaling

- Validation loss improves: **2.24 → 2.08**
- Slight overfitting starts appearing (train loss drops faster than val loss)

---

## 🔃 Solution 3: Layer Normalization (LayerNorm)

- Normalizes features **per token** (across embedding dim)
- Very similar to BatchNorm, but:
  - Does not depend on batch size
  - No running stats needed
  - No distinction between train/test mode

### PyTorch:
```python
nn.LayerNorm(n_embed)
```

- Implemented **before** self-attention and feedforward (PreNorm)
  ```python
  x = x + sa(self.ln1(x))
  x = x + ff(self.ln2(x))
  ```

> Normalizing before transformation improves training stability

---

## ✅ Updated Transformer Block Structure

```
Input ──► LayerNorm ──► Multi-Head Self Attention ──► + ──► LayerNorm ──► Feedforward ──► + ──► Output
```

---

## 🛡️ Dropout for Regularization

- Introduced Dropout (p=0.2) at several points:
  - After self-attention before residual add
  - After feedforward before residual add
  - After softmax (drop connections randomly)

### Why Dropout?
- Randomly disables parts of the network → prevents overfitting
- Acts like an **ensemble** of subnetworks

---

## 📏 Scaling Up the Transformer

### New Hyperparameters:
```python
batch_size = 64
block_size = 256        # Context length
n_embed = 384
n_head = 6              # → head_size = 64
n_layer = 6
dropout = 0.2
```

### Other Training Adjustments:
- Learning rate lowered (larger network → more sensitive)
- Trained longer on A100 GPU (~15 min)

### Results:
- Validation loss improved from **2.07 → 1.48**
- Generated text resembles Shakespeare in style but remains semantically nonsensical

---

## 📘 Decoder-Only Transformer

### What's Missing?
- **No Encoder Block**
- **No Cross-Attention Block**

### Why?
- We're doing **unconditioned language modeling**
- Just need to predict next token based on context (causal)

### Masking:
- Uses **lower-triangular masking** to enforce autoregressive property
- No "cheating" by looking at the future

> This is exactly what a **GPT-style decoder-only Transformer** does.


Here are the **final detailed notes** wrapping up the full walkthrough of building a **decoder-only Transformer**, understanding how it's used in **GPT-style models**, and how it connects to **ChatGPT's training pipeline**. This ties everything together—from raw self-attention to real-world deployment stages like fine-tuning and reinforcement learning.

---

## 🎓 Final Concepts and Wrap-up Notes

---

## 🔁 Decoder-Only Transformers

### 🔹 What makes it a “Decoder”?
- Uses a **triangular (causal) mask** in attention.
- Enforces **auto-regressive generation**: each token only sees past tokens.
- Suitable for **language modeling** — predicting the next token given previous context.

### 🔹 When do you use a Decoder Only?
- **Unconditional generation**: "Just babble based on prior data."
- No external context (e.g., translation source, prompt), just a corpus (like Shakespeare).

---

## 🧩 Encoder-Decoder Transformers

### 🔹 Why do we need an encoder?
- **Used in tasks like translation** where output must depend on external input.
  - Input: Sentence in French
  - Output: Sentence in English

### 🔹 Architecture:

1. **Encoder**:
   - No masking (tokens can freely attend to each other).
   - Converts input (e.g., French) into contextual embeddings.

2. **Decoder**:
   - Predicts output (e.g., English) **one token at a time**.
   - Attends to:
     - Its own **past tokens** (via causal mask)
     - Full **encoder output** via **cross-attention**.

### 🔹 Cross-Attention:
- **Queries** from the decoder (current token)
- **Keys and Values** from encoder outputs
- Allows decoder to “look at” encoder context while generating

---

## 🧪 Summary of What We Built

| Module                 | Status             | Notes |
|------------------------|--------------------|-------|
| Embedding + Positional | ✅ Done             | Learnable embeddings |
| Self-Attention         | ✅ With scaling, masking, dropout |
| Multi-Head Attention   | ✅ Heads run in parallel |
| Feedforward Layer      | ✅ Width = 4 × embedding dim |
| Residual Connections   | ✅ On attention & feedforward |
| Layer Norm             | ✅ PreNorm style |
| Dropout                | ✅ For regularization |
| Stacked Blocks         | ✅ Parameterized by `n_layer` |
| Final Layer Norm       | ✅ Before output projection |
| Output Head            | ✅ Linear layer to vocabulary size |
| Model Type             | ✅ Decoder-only GPT-style |

> Architecture is essentially **identical to GPT models**, just much smaller.

---

## 🛠️ Scaling Up: Hyperparameters (Final Large Model)

| Hyperparameter     | Value       |
|--------------------|-------------|
| `n_layer`          | 6           |
| `n_head`           | 6           |
| `n_embed`          | 384         |
| `block_size`       | 256         |
| `dropout`          | 0.2         |
| `batch_size`       | 64          |
| `learning_rate`    | lower than earlier (due to deeper net) |

### 🔹 Results:
- Validation loss: **1.48** (down from 2.07)
- Text generation mimics Shakespearean structure (though not semantically coherent yet)

---

## 🧠 Walkthrough (Codebase Overview)

### 🔹 Two Key Files:
- `train.py`: Training loop, checkpointing, learning rate scheduling, DDP support
- `model.py`: Full Transformer model (similar to what we built)

### 🔹 Differences:
- Uses **batched multi-head attention** via 4D tensors
- Built for **efficiency and scaling**
- Uses GELU instead of ReLU in MLP (to match OpenAI's models)
- Supports loading pretrained checkpoints

---

## 🤖 From GPT to ChatGPT: What’s Different?

### 📌 GPT Pretraining:
- Large **decoder-only Transformer**
- Trained on internet-scale corpora (~300B tokens)
- Goal: **document continuation**
- Output: **unfiltered text completion**, not helpful Q&A

### 📌 ChatGPT Alignment Pipeline:

1. **Supervised Fine-Tuning (SFT)**:
   - Train on Q&A pairs (assistant-style formatting)
   - Teaches the model to start behaving like a helper

2. **Reward Model (RM) Training**:
   - Human raters **rank multiple completions**
   - Train a new model to **score responses**

3. **Reinforcement Learning (PPO)**:
   - Generate responses
   - Use RM to give reward
   - Tune the base model to maximize reward → better, safer responses

> The final ChatGPT pipeline is:
> ```
> GPT (Pretrained) → SFT → Reward Model → PPO → ChatGPT
> ```

---

## 📈 Our Tiny GPT vs GPT-3

| Attribute              | Our Model       | GPT-3             |
|------------------------|-----------------|-------------------|
| Parameters             | ~10M            | 175B              |
| Tokens Trained On      | ~300K           | 300B              |
| Vocab Type             | Character-level | Subword BPE (50K) |
| Architecture           | GPT-style       | GPT-style         |
| Training Time          | ~15 mins (A100) | Weeks (Thousands of GPUs) |

