# Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

## Load data

In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-11-15 04:10:14--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-11-15 04:10:14 (19.1 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [4]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## Tokenizer, Encoder, Decoder (simplified)

### Tokenizer
A tokenizer is a tool that splits a given input (like a sentence) into smaller units called tokens. These tokens can be words, characters, or subwords. In the provided code, the tokenizer implicitly treats each character as a token. This is evident from the fact that the mappings `stoi` (string to integer) and `itos` (integer to string) are created based on individual characters (`chars`) in the input.

### Encoder
An encoder converts the tokens into a format that a machine learning model can understand, typically numerical. In this code, the `encode` function is the encoder. It takes a string as input and converts each character in the string to its corresponding integer based on the `stoi` mapping. This process transforms the textual data into a list of integers, which is a format suitable for computational models.

### Decoder
A decoder performs the reverse operation of the encoder. It converts the machine-readable format (like integers) back into a human-readable format (like text). In the code, the `decode` function is the decoder. It takes a list of integers (the encoded format) and converts each integer back to its corresponding character using the `itos` mapping. Then, it joins these characters to form a string, reconstructing the original text or a translated version.

### Overall Workflow
The overall process in the code demonstrates a basic NLP pipeline:
1. **Tokenization**: The input string is implicitly tokenized into characters.
2. **Encoding**: Each character (token) is converted into an integer.
3. **Processing**: While not explicitly shown in the code, this step would typically involve some form of processing or transformation, like passing the encoded data through a machine learning model.
4. **Decoding**: The processed data is converted back into a human-readable format, reconstructing the original message or translating it into another form.

This kind of tokenizer, encoder, and decoder framework is fundamental in many areas of NLP and is particularly crucial in models like machine translation, text summarization, and language generation models.

In [6]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode([46, 47, 47, 1, 58, 46, 43, 56, 43]))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there
hii there


In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      


## Train, Valid, Test


**Splitting the Data**:
   - `train_data = data[:n]`: This line creates the training dataset (`train_data`) by taking the first `n` elements from the `data`. Since `n` is 90% of the total data, this means `train_data` will have 90% of the entries from the original dataset.
   - `val_data = data[n:]`: This line creates the validation dataset (`val_data`) by taking the remaining data points in `data` from the `n`th index to the end. This subset will be the remaining 10% of the data not included in the training set.

### Importance of Training, Validation, and Testing Sets

#### Training Set
- **Purpose**: The training set is used to train the machine learning model. It is the primary dataset on which the model learns to make predictions or classifications.
- **Importance**: Without a training set, the model would have no examples to learn from, similar to how students need lectures or textbooks to learn a subject.

#### Validation Set
- **Purpose**: The validation set is used to evaluate the model during the training process. It helps in tuning the model's parameters and provides a check against overfitting.
- **Overfitting**: Overfitting is when a model performs well on the training data but poorly on new, unseen data. It's like a student who memorizes facts for a test but doesn't understand the concepts well enough to apply them in different situations.
- **Importance**: The validation set acts like a practice exam. It helps understand how well the model is learning and generalizing beyond the training data.

#### Testing Set
- **Purpose**: The testing set is used to evaluate the model's performance after the training is complete. It's a final check to see how well the model will perform on entirely new data.
- **Importance**: The test set is like a final exam taken after all the studies and revisions are done. It gives an unbiased evaluation of the model's performance in real-world scenarios.

#### Simplified Explanation for Students
Think of the whole machine learning process like preparing for a big exam:
- **Training Set**: This is your study material, like textbooks and notes, which you use to learn and understand the topics.
- **Validation Set**: These are like practice tests you take while studying. They help you see what you've learned well and what you need to study more. They also stop you from just memorizing the textbook (preventing overfitting).
- **Testing Set**: This is the final exam, taken after all your studies and practice tests. It shows how well you've learned the material and can apply it to new questions you've never seen before.

In [8]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

This code snippet is designed for batching in a decoder-only model, commonly used in tasks like language modeling where the goal is to predict the next token in a sequence. Let's first understand the code and then delve into the concept of a decoder-only model.

### Understanding the Code

1. **Setting the Seed and Parameters**:
   - `torch.manual_seed(1337)` sets a fixed seed for random number generation in PyTorch, ensuring reproducibility of results.
   - `batch_size` determines how many sequences are processed in parallel. Here, it's set to 4.
   - `block_size` is the maximum context length for predictions, set to 8.

2. **Batch Generation (`get_batch` function)**:
   - The function selects either `train_data` or `val_data` based on the `split` argument ('train' or 'val').
   - It generates indices (`ix`) for selecting random starting points in the data.
   - Inputs (`x`) and targets (`y`) are created based on these indices. Each input sequence in `x` is a sequence of `block_size` tokens from `data`, and each target sequence in `y` is the subsequent token in the sequence, used as the prediction target.

3. **Printing Batch Details**:
   - The shapes and contents of the input (`xb`) and target (`yb`) batches are printed.
   - The loop at the end iteratively prints each context and its corresponding target from the batch. This illustrates how the model should predict the next token based on the given context.

### Decoder-Only Model Explanation

In machine learning, particularly in natural language processing, a decoder-only model is structured to generate text or predict the next item in a sequence. Here’s why it’s useful and how it works:

1. **Purpose**: A decoder-only model is designed for tasks where the output is a continuation of the input, like text generation, where you predict the next word in a sentence.

2. **Functioning**:
   - **Context Understanding**: The model takes a sequence of tokens (words, characters, etc.) as input and understands the context.
   - **Next Token Prediction**: Based on this context, it predicts the next token in the sequence.

3. **Applications**: This model architecture is prominent in applications like language models (GPT series, for example), where the model generates text, completes sentences, or even writes code based on the given prompt.

4. **Training Process**: During training, the model learns by looking at a part of a text sequence and predicting the next part. This is what the provided code is preparing the data for. By repeatedly practicing this task, the model learns patterns in language and can generate coherent and contextually relevant text.

In simple terms, a decoder-only model in language tasks is like a skilled storyteller. You give it the beginning of a story (the context), and it learns to continue the story in a way that makes sense and is engaging, learning better storytelling techniques as it practices on more and more stories (training data).

In [9]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [10]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [11]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [12]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


This code defines a basic bigram language model using PyTorch, a popular deep learning library. The model predicts the next word in a sequence based on the current word. Let's break down the code and explain its components, including the cross-entropy loss and the shapes of the vectors involved.

### Cross-Entropy Loss
Cross-entropy loss is commonly used in classification tasks. It measures the difference between two probability distributions: the true distribution (from the target labels) and the predicted distribution (from the model's output). In this context, it's used to measure how well the model's predictions match the actual next words (targets).

### Shape of Vectors
- **`logits` Shape**: The shape of `logits` is (B, T, C), where B is the batch size, T is the sequence length (number of tokens in each sequence), and C is the number of classes (same as the vocabulary size).
- **`targets` Shape**: The `targets` tensor is reshaped to (B*T) to align with the flattened `logits` for loss calculation. This is necessary because cross-entropy loss in PyTorch expects inputs of shape (N, C) where N is the number of samples (here, B*T) and C is the number of classes.

### Overall Functionality
This model takes sequences of token indices as input and predicts the next token in each sequence. It uses an embedding layer to map tokens to vectors and calculates the cross-entropy loss to measure the accuracy of its predictions. The `generate` function allows the model to create new text sequences based on an initial input.

In [13]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)  # Setting a random seed for reproducibility

class BigramLanguageModel(nn.Module):
    # A language model based on bigrams (pairs of consecutive words)

    def __init__(self, vocab_size):
        super().__init__()
        # Embedding layer: maps each token to a vector with the same size as the vocabulary
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # Forward pass of the model
        # idx: input tensor containing indices of the current tokens

        logits = self.token_embedding_table(idx)  # Convert indices to logits

        if targets is None:
            loss = None  # No loss calculation if targets are not provided
        else:
            B, T, C = logits.shape  # Batch size (B), sequence length (T), and number of classes (C)
            logits = logits.view(B*T, C)  # Reshape logits for cross-entropy calculation
            targets = targets.view(B*T)  # Flatten the targets
            loss = F.cross_entropy(logits, targets)  # Calculate the cross-entropy loss

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # Function to generate text from the model
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]  # Focus on the last set of logits
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample the next token
            idx = torch.cat((idx, idx_next), dim=1)  # Append the next token to the sequence

        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)  # Forward pass with inputs xb and targets yb
print(logits.shape)  # Print the shape of the logits tensor
print(loss)  # Print the loss
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
# Generate text starting with an initial index of zero and create 100 new tokens


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [14]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [15]:
batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


3.7218432426452637


In [16]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


olylvLLko'TMyatyIoconxad.?-tNSqYPsx&bF.oiR;BD$dZBMZv'K f bRSmIKptRPly:AUC&$zLK,qUEy&Ay;ZxjKVhmrdagC-bTop-QJe.H?x
JGF&pwst-P sti.hlEsu;w:w a BG:tLhMk,epdhlay'sVzLq--ERwXUzDnq-bn czXxxI&V&Pynnl,s,Ioto!uvixwC-IJXElrgm C-.bcoCPJ
IMphsevhO AL!-K:AIkpre,
rPHEJUzV;P?uN3b?ohoRiBUENoV3B&jumNL;Aik,
xf -IEKROn JSyYWW?n 'ay;:weO'AqVzPyoiBL? seAX3Dot,iy.xyIcf r!!ul-Koi:x pZrAQly'v'a;vEzN
BwowKo'MBqF$PPFb
CjYX3beT,lZ qdda!wfgmJP
DUfNXmnQU mvcv?nlnQF$JUAAywNocd  bGSPyAlprNeQnq-GRSVUP.Ja!IBoDqfI&xJM AXEHV&DKvRS


## The mathematical trick in self-attention

In [17]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [18]:
%%time
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

CPU times: user 1.67 ms, sys: 1.17 ms, total: 2.85 ms
Wall time: 7.42 ms


torch.Size([4, 8, 2])

In [19]:
%%time
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


CPU times: user 1.67 ms, sys: 869 µs, total: 2.54 ms
Wall time: 2.33 ms


In [20]:
%%time
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

CPU times: user 1.36 ms, sys: 0 ns, total: 1.36 ms
Wall time: 4.74 ms


False

In [21]:
%%time
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


CPU times: user 1.55 ms, sys: 0 ns, total: 1.55 ms
Wall time: 1.56 ms


False

In [22]:
%%time
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

CPU times: user 2.2 ms, sys: 921 µs, total: 3.12 ms
Wall time: 5.91 ms


torch.Size([4, 8, 16])

In [23]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

## More details - Notes
Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below


-------------------

The notes you've provided summarize key concepts related to attention mechanisms in neural networks, particularly in the context of models like transformers. Let's delve into these concepts with more detail and then provide a simplified explanation.

### Detailed Explanation

1. **Attention as a Communication Mechanism**:
   - In neural networks, attention allows different parts of the input data to 'communicate' with each other. Imagine attention as nodes in a graph where each node can 'look' at other nodes and gather information from them. The information aggregation is done through a weighted sum, where the weights are determined based on the relevance of other nodes' information to the current node.

2. **No Notion of Space and Need for Positional Encoding**:
   - Attention mechanisms operate on sets of vectors without inherent ordering or positioning. This means they don't naturally understand the order of words in a sentence or pixels in an image.
   - To address this, we use positional encoding. It adds extra information to each vector to indicate its position (or order) in the sequence, allowing the model to understand sequence order or spatial relationships.

3. **Independent Processing Across Batch Dimension**:
   - In batch processing, multiple examples (like sentences or images) are processed simultaneously for efficiency. However, each example in a batch is processed independently. That means information from one example in the batch does not influence the processing of another.

4. **Encoder vs. Decoder Attention Blocks**:
   - In attention-based models, there are typically two types of blocks: encoder and decoder blocks.
   - Encoder blocks allow each token (like a word in a sentence) to attend to all other tokens. There is no restriction on what each token can 'see'.
   - Decoder blocks, on the other hand, use triangular masking (with `tril`). This masking prevents a token from attending to future tokens, which is crucial in autoregressive tasks (like generating text, where the future words are not yet known).

5. **Self-Attention and Cross-Attention**:
   - Self-attention means the keys, queries, and values in the attention mechanism all come from the same input source. Essentially, the input data is interacting with itself, finding relationships within.
   - Cross-attention, however, uses queries from one source (like the current input in a decoder) and keys and values from another source (like the output of an encoder). This is common in tasks like machine translation, where the model needs to relate two different sequences (like sentences in two languages).

6. **Scaled Attention**:
   - In scaled attention, the attention weights are scaled down by a factor of 1/sqrt(head_size) (head size is a dimension of the key/query vectors). This scaling helps in stabilizing the learning process.
   - It ensures that the attention weights don't become too large, preventing the softmax function from becoming too 'sharp'. A sharp softmax would mean the model pays attention to very few tokens excessively and ignores the rest, which is not desirable for learning nuanced relationships in the data.

### Simplified Explanation

Imagine you're in a room full of people, and you're trying to figure out the most relevant conversations to pay attention to:

- **Attention Mechanism**: It's like you have a special ability to listen to multiple conversations at once and decide which ones are most important based on what's being said.
- **Positional Encoding**: Since people talk in sequences (one word after the other), you need to understand the order of words in each conversation. Positional encoding is like your ability to remember the sequence of words in each conversation.
- **Independent Processing**: Even if there are multiple groups in the room, what happens in one group doesn't affect your understanding of another group.
- **Encoder vs. Decoder**: If you're just listening (encoder), you can hear all parts of a conversation. But if you're also talking (decoder), you can't talk about parts of the conversation that haven't happened yet.
- **Self-Attention vs. Cross-Attention**: Self-attention is like understanding a conversation by relating different parts of the same conversation. Cross-attention is like using information from one conversation to understand another.
- **Scaled Attention**: This is akin to not focusing too much on a loud voice in the room (scaling down), so you can still understand the general noise level and not miss out on other important conversations.

In [24]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Defining the BigramLanguageModel class which inherits from nn.Module
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, device):
        super().__init__()
        # Initialize embeddings and network layers

        # Token embedding table, converting token indices into embeddings
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # Positional embedding table, encoding the position of each token in a sequence
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # A sequence of transformer blocks for processing the embeddings
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

        # Layer normalization applied to the output of the last block
        self.ln_f = nn.LayerNorm(n_embd)

        # Linear transformation to project the output back to the vocabulary space
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        # Forward pass of the model

        # Extracting batch size (B) and sequence length (T) from input indices
        B, T = idx.shape

        # Getting token embeddings for the input indices
        tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, C)

        # Generating positional embeddings for each position in the sequence
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # Shape: (T, C)

        # Adding token embeddings and positional embeddings
        x = tok_emb + pos_emb  # Shape: (B, T, C)

        # Passing the combined embeddings through the transformer blocks
        x = self.blocks(x)  # Shape: (B, T, C)

        # Applying layer normalization
        x = self.ln_f(x)  # Shape: (B, T, C)

        # Projecting the output to vocabulary size to get logits
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is not None:
            # Compute loss if targets are provided
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        else:
            # No loss computation if targets are not provided
            loss = None

        return logits, loss


In [25]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [26]:
k.var()

tensor(1.0449)

In [27]:
q.var()

tensor(1.0700)

In [28]:
wei.var()

tensor(1.0918)

In [29]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [30]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

### Simplified Explanation

Think of `LayerNorm1d` as a tool to make sure that the data being processed by a neural network is more standardized or normalized. Here's a simple analogy: Imagine you're a teacher grading a bunch of tests. Some tests are really tough, and some are very easy, so the scores are all over the place. To make them more comparable, you adjust the scores (normalization) so that each test has an average score of 0 and the spread of scores is consistent.

- **Initialization (`__init__`)**: This is like setting up your grading scheme. You decide on the basic adjustments (parameters `gamma` and `beta`) you'll use to standardize scores. `gamma` is like a multiplier to adjust the spread of scores, and `beta` is like an add-on to adjust the average score.

- **Applying Layer Normalization (`__call__`)**: Here, you take each test's scores, figure out how much they vary from the average (mean and variance), and then adjust them (normalize) using your grading scheme. This makes the scores on each test more comparable to the scores on other tests.

- **Parameters (`parameters`)**: This method is like keeping a record of your grading scheme, so you know how you've adjusted the scores.

- **Output**: After applying this layer normalization to a batch of data (like a bunch of tests), the result is a more standardized set of data that's easier for the neural network to work with, just like how normalized test scores are easier to compare and understand.

In [31]:
class LayerNorm1d:  # Custom class for Layer Normalization

    def __init__(self, dim, eps=1e-5, momentum=0.1):
        # Constructor for initializing the layer
        self.eps = eps  # A small number to prevent division by zero
        self.gamma = torch.ones(dim)  # Scale parameter, initialized to ones
        self.beta = torch.zeros(dim)  # Shift parameter, initialized to zeros

    def __call__(self, x):
        # Method to apply layer normalization on input x
        xmean = x.mean(1, keepdim=True)  # Calculate mean of each batch
        xvar = x.var(1, keepdim=True)  # Calculate variance of each batch
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # Normalize inputs
        self.out = self.gamma * xhat + self.beta  # Apply scaling and shifting
        return self.out

    def parameters(self):
        # Returns parameters of the layer (gamma and beta) for optimization
        return [self.gamma, self.beta]

# Set the seed for reproducibility
torch.manual_seed(1337)

# Create an instance of LayerNorm1d for 100-dimensional input
module = LayerNorm1d(100)

# Generate a batch of 32 random 100-dimensional vectors
x = torch.randn(32, 100)

# Apply layer normalization to the batch
x = module(x)

# Check the shape of the output
x.shape


torch.Size([32, 100])

In [32]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [33]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

In [34]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [38]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """Implements multi-head self-attention mechanism."""

    def __init__(self, num_heads, head_size):
        super().__init__()
        # Initialize multiple attention heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

        # Linear projection layer
        self.proj = nn.Linear(n_embd, n_embd)

        # Dropout layer for regularization
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate the outputs from all attention heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)

        # Apply linear projection and dropout
        out = self.dropout(self.proj(out))
        return out

class FeedForward(nn.Module):
    """Implements a feed-forward neural network as part of the Transformer block."""

    def __init__(self, n_embd):
        super().__init__()
        # Define a simple feed-forward network
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),  # Expand the input
            nn.ReLU(),                      # ReLU activation function
            nn.Linear(4 * n_embd, n_embd),  # Project back to original size
            nn.Dropout(dropout),            # Dropout layer for regularization
        )

    def forward(self, x):
        # Pass the input through the feed-forward network
        return self.net(x)


class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5091, val loss 2.5058
step 300: train loss 2.4197, val loss 2.4336
step 400: train loss 2.3501, val loss 2.3562
step 500: train loss 2.2963, val loss 2.3125
step 600: train loss 2.2407, val loss 2.2496
step 700: train loss 2.2054, val loss 2.2187
step 800: train loss 2.1633, val loss 2.1866
step 900: train loss 2.1241, val loss 2.1504
step 1000: train loss 2.1036, val loss 2.1306
step 1100: train loss 2.0698, val loss 2.1180
step 1200: train loss 2.0380, val loss 2.0791
step 1300: train loss 2.0248, val loss 2.0634
step 1400: train loss 1.9926, val loss 2.0359
step 1500: train loss 1.9697, val loss 2.0287
step 1600: train loss 1.9627, val loss 2.0477
step 1700: train loss 1.9403, val loss 2.0115
step 1800: train loss 1.9090, val loss 1.9941
step 1900: train loss 1.9092, val loss 1.9858
step 2000: train loss 1.8847, val loss 1.9925
step 2100: train loss 1.