# GPT From Scratch

**References**
- *Let's build GPT: from scratch, in code, spelled out: [Youtube Video](https://youtu.be/kCc8FmEb1nY?si=gFwgcL_pakPYfUtA), [nanoGPT Code](https://github.com/karpathy/nanoGPT), [Code from Video](https://github.com/karpathy/ng-video-lecture)*
- *Intro to Large Language Models: [Youtube Video](https://youtu.be/zjkBMFhNj_g?si=qKW-b5B0aHVpSLtY)*



## Imports

In [1]:
from typing import List, Tuple

import torch
import torch.nn as nn
from torch.nn import functional as F


## Data Exploration

In [2]:
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open ("data/input.txt", "r", encoding="utf-8") as f:
    text = f.read()

print(f"length of the dataset: {len(text)}")

length of the dataset: 1115394


In [3]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
# to get the unique characters that occur in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"characters: {''.join(chars)}\nvocab size: {vocab_size}")

characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65


### Tokenizer

In [5]:
# since we are going to build a character level model, we create a map from chars to ints

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

def encode(s: str) -> List:
    """Take a string and output a list of integers"""
    return [stoi[c] for c in s]

def decode(list_: List) -> str:
    """Take a list of integers and output a string"""
    return "".join([itos[i] for i in list_])

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [6]:
# tokenize the entire dataset
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Data Split and Batching

In [7]:
# splitting dataset
n = int(0.9 * len(data))

train_data = data[:n]
val_data = data[n:]

In [8]:
# block size to send into the transformer
block_size = 8
train_data[:block_size]

tensor([18, 47, 56, 57, 58,  1, 15, 47])

In [9]:
# showing what we consider as next token prediction.
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[: t+1]
    target = y[t]
    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [10]:
# code to generate batches
torch.manual_seed(1337)

batch_size = 4  # how many independent sequences will we process in parallel?
block_size = 8  # what is the maximum context length for predictions

def get_batch(split: str) -> Tuple:
    """Generate a small batch of data of inputs x and targets y."""
    data = train_data if split == 'train' else val_data
    # ix randomly returns four indexes from the dataset that is going to be in a batch.
    ix = torch.randint(len(data) - block_size, (batch_size,))  # ([ 76049, 234249, 934904, 560986])
    x = torch.stack([data[i: i+block_size] for i in ix])
    y = torch.stack([data[i+1: i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

# ? Looking at this I wonder if we are batching randomly, how will it learn context
# ? from different paragraphs. Say you have multiple documents, create into a single doc
# ? And then batching it will pull things from different sources

print(f'inputs: {xb.shape}\n{xb}\n')
print("Decoded Inputs")
for x in xb:
    print(decode(x.tolist()))

print(f'\ntargets: {yb.shape}\n{yb}\n')
print("Decoded Outputs")
for y in yb:
    print(decode(y.tolist()))

print('\n----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs: torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

Decoded Inputs
Let's he
for that
nt that 
MEO:
I p

targets: torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

Decoded Outputs
et's hea
or that 
t that h
EO:
I pa

----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when inp

## Bigram Language Model

In [11]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    """All I am doing here is training an Embedding Layer on my vocab.
    A Bigram Language Model is a type of statistical language model that predicts 
    the probability of the next word in a sequence based solely on the previous word. 
    It focuses on pairs of consecutive words, also called bigrams.

    Training: The model is trained on a large text corpus. 
    It analyzes the frequency of word pairs appearing together.
    
    Prediction: Given a word, the model searches its database of bigrams and identifies 
    the most likely word to follow it based on the observed frequencies during training.
    """
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        # initially the weights of embedding is random
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensor of integers
        logits = self.token_embedding_table(idx)  # (B, T, C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)  # reshape this to satisfy cross entropy shape requirements.
            targets = targets.view(B*T)  # reshape this to satisfy cross entropy shape requirements.
            loss = F.cross_entropy(logits, targets)
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

print(f"vocab_size: {vocab_size}")
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(f"logits: {logits.shape}, loss: {loss}\n")

sample_generate = decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist())
print(f"sample 500 token generate:\n{sample_generate}")


vocab_size: 65
logits: torch.Size([32, 65]), loss: 4.878634929656982

sample 500 token generate:

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [19]:
# Looking to understand the embedding little better
# token_embedding_table = nn.Embedding(2, 2)  # randomly initialized
# print(f"token_embedding_table: {token_embedding_table}")
# idx = torch.zeros((1, 1), dtype=torch.long)
# print(f"idx: {idx}")
# logits = token_embedding_table(idx)
# print(f"logits: {logits}")

token_embedding_table: Embedding(2, 2)
idx: tensor([[0]])
logits: tensor([[[-0.7637, -0.4608]]], grad_fn=<EmbeddingBackward0>)


In [32]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

batch_size = 32
for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"loss: {loss.item()}\n")
sample_generate = decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist())
print(f"sample 500 token generate:\n{sample_generate}")


loss: 2.3765861988067627

sample 500 token generate:

ORDUCKIVO:CI t
WI'd e d whe, thofre nesBYOFit t lloounroithe CELYond d WAnowe OMalot.
Fobl on?
TCar hed pl ssto huntl d uthavindad as chathe hemnde,
Clor
HEGoun.
Theredill t inthiteme o-t t Cound futscolg y nlatayon ofoamakerth cl d bell the gs f I'sloow t th bathtoel tht s rowe y itor
Fal e pst mete,

BYowillier wisallly t e,

DUS: tasissaresuthenis beg, r fise setong-witele on arothalau wn:
The, ay k t hofr tiWhyst o suilous 'de LLErerre tssu PAn my tharr gs mooua pedealabes;
Thew m.
Whifoff a


## Self Attention

### The mathematical trick in self-attention

In [12]:
# consider the following toy example
# version 1: brute force approach
B, T, C = 4, 8, 2  # batch, time, channels
x = torch.randn(B, T, C)
print(f"x: {x.shape}")

# We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]  # (t, C)
        xbow[b, t] = torch.mean(xprev, 0)

# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x  # (B, T, T) @ (B, T, C) ---> (B, T, C)
print(f"xbow=xbow2: {torch.allclose(xbow, xbow2)}")

x: torch.Size([4, 8, 2])
xbow=xbow2: True


In [13]:
print("the following illustrates version 2")
# the idea is to create something that summarizes myself in the context of my history - I am a word.
a = torch.tril(torch.ones(3, 3))  # extract the lower triangular part of 1x1 square matrix
a = a / torch.sum(a, 1, keepdim=True)  # average of a row (1D)
print(f'a=\n{a}')  # this shows how certain 'words' are just an avg of what came before
b = torch.randint(0, 10, (3, 2)).float()
print(f'b=\n{b}')
c = a @ b
print(f'c=\n{c}')

the following illustrates version 2
a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b=
tensor([[2., 1.],
        [9., 7.],
        [6., 8.]])
c=
tensor([[2.0000, 1.0000],
        [5.5000, 4.0000],
        [5.6667, 5.3333]])


In [14]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
print(f"wei original:\n{wei}\n")
wei = wei.masked_fill(tril == 0, float('-inf'))  # masked_fil picks all the elements where tril ==0
print(f"wei after masked_fill using trill:\n{wei}\n")
wei = F.softmax(wei, dim=-1)  # softmax across row acts as a normalization
print(f"wei after softmax:\n{wei}\n")
xbow3 = wei @ x
print(f"xbow=xbow3: {torch.allclose(xbow, xbow3)}")

wei original:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

wei after masked_fill using trill:
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

wei after softmax:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0

### self-attention

In [21]:
# version 4: self-attention
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B, T, C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)  # (B, T, 16)
q = query(x)  # (B, T, 16)
wei = q @ k.transpose(-2, -1)  # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
print(f"raw wei between nodes:\n{wei[0]}")

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
# out = wei @ x

print(f"wei:\n{wei[0]}")
print(f"out: {out.shape}")

raw wei between nodes:
tensor([[-1.7629, -1.3011,  0.5652,  2.1616, -1.0674,  1.9632,  1.0765, -0.4530],
        [-3.3334, -1.6556,  0.1040,  3.3782, -2.1825,  1.0415, -0.0557,  0.2927],
        [-1.0226, -1.2606,  0.0762, -0.3813, -0.9843, -1.4303,  0.0749, -0.9547],
        [ 0.7836, -0.8014, -0.3368, -0.8496, -0.5602, -1.1701, -1.2927, -1.0260],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,  0.8638,  0.3719,  0.9258],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,  1.4187,  1.2196],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,  0.8048],
        [-1.8044, -0.4126, -0.8306,  0.5898, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)
wei:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [24]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei_no_scaled = q @ k.transpose(-2, -1)
wei_scaled = q @ k.transpose(-2, -1) * head_size**-0.5

print(f"k: {k.var()}")
print(f"q: {q.var()}")
print(f"wei_no_scaled: {wei_no_scaled.var()}")
print(f"wei_scaled: {wei_scaled.var()}")

print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1))
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1))  # gets too peaky, converges to one-hot

k: 1.0966005325317383
q: 0.9415779113769531
wei_no_scaled: 16.10364532470703
wei_scaled: 1.0064778327941895
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


### LayerNorm

In [28]:
class LayerNorm1d:  # (reduced from BatchNorm1d)
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
    
    def __call__(self, x):
        # calculate the forward pass
        xmean = x.mean(1, keepdim=True)  # batch mean
        xvar = x.var(1, keepdim=True)  # batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]

module = LayerNorm1d(100)
x = torch.randn(32, 100)  # batch size 32 of 100-dimensional vectors
x = module(x)
print(f"x: {x.shape}")
print(f"mean, std of one feature across all batch inputs: {x[:, 0].mean()}, {x[:, 0].std()}")
print(f"mean, std of a single input from the batch, of its features: {x[0, :].mean()}, {x[0, :].std()}")

x: torch.Size([32, 100])
mean, std of one feature across all batch inputs: -0.22315755486488342, 0.8365175127983093
mean, std of a single input from the batch, of its features: -1.4305114426349519e-08, 0.9999951720237732


## Full Code

In [58]:
# Bigram Model
batch_size = 16  # 64  # how many independent sequences will we process in parallel?
block_size = 32  # 256  # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100  # 500
learning_rate = 1e-3  # 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64  # 384
n_head = 4  # 6
n_layer = 4  # 6
dropout = 0.0  # 0.2

### Data Loader

In [None]:
torch.manual_seed(1337)

# read text file
with open('data/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

def encode(s: str) -> List:
    """Take a string and output a list of integers"""
    return [stoi[c] for c in s]

def decode(list_: List) -> str:
    """Take a list of integers and output a string"""
    return "".join([itos[i] for i in list_])

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


### BigramLanguageModel

In [54]:
class BigramLanguageModel(nn.Module):
    """Bigram Model."""
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensor of integers
        logits = self.token_embedding_table(idx)  # (B, T, C)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


In [55]:
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTOrch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.7384, val loss 4.7349
step 500: train loss 4.5202, val loss 4.5179
step 1000: train loss 4.3157, val loss 4.3143
step 1500: train loss 4.1244, val loss 4.1249
step 2000: train loss 3.9484, val loss 3.9496
step 2500: train loss 3.7846, val loss 3.7893
step 3000: train loss 3.6349, val loss 3.6398
step 3500: train loss 3.4987, val loss 3.5031
step 4000: train loss 3.3745, val loss 3.3813
step 4500: train loss 3.2634, val loss 3.2718

B;?eWlzabuNl3 r Jkh3q?Ax-lRSPBOFLAysHinllgrjjShGramoMEHIKi fasshod ade,
AM: KwhteFFSVdE'Gr cUR:&NMA.
CisPjookpaverI't O-olXSeILZoOQS&TTuJKedbeXanuBe
SVredoMoGSS'swiois&XfaLycre
zSReGo LeCTIfo ymUo&Tway hiyocowLore $jhOGhPlxthpUP$C:U3 aver llndo3ucy



PEdsbenceadadMRng lZfFI.uUt w s y areaGs?-!
Ejcq?drRVIhknVShX-airon: apHUEu omUEhA.
PE&n.Uv.lTIsRzLevekn :$shRUa mNUp,
DUtA:'se. hetvDDK'beDrRCN p$&.hoO$fy,
Wera-woVUnofIxBmolqAcig ht -Yv'TwQVVoWIitiny,,y:&Jrjhpr dknsl
TofUfzo
;expustPjqfu cePleePJ


### Transformer Model Helper

In [None]:
class Head(nn.Module):
    """One Head of Self-Attention."""
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # (B, T, C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B, T, C)
        out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out


class MultiHeadAttention(nn.Module):
    """Multiple Heads of Self-Attention in Parallel."""
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


class FeedForward(nn.Module):
    """A simple Linear Layer followed by a non-linearity."""
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
            nn.Dropout(dropout)
        )
    
    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    """Transformer block: communication followed by computation."""
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


### GPTLanguageModel

In [56]:
class GPTLanguageModel(nn.Module):
    """GPT Model.
    This is a decoder only transformer and doesn't have the cross attention part.
    The reason we don't have encoder is because we are just generating text.
    The original attention paper has encoder and decoder because the paper is a 
    machine translation paper. 
    For example to translate something from french to english:
    
    <--------- ENCODE ------------------><--------------- DECODE ----------------->
    les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>
    """
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # encodes each word in the vocabulary
        self.position_embedding_table = nn.Embedding(block_size, n_embd)  # captures the position of each word in the sequence
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        # idx and targets are both (B, T) tensor of integers
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # (B, T, C) - Passes the combined embedding through a sequence of Block modules.
        x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


In [59]:
model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    # every once in a while evaluate the loss on train and val sets
    if (iter % eval_interval == 0) or (iter == max_iters - 1):
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.209729 M parameters
step 0: train loss 4.1731, val loss 4.1736
step 100: train loss 2.6061, val loss 2.6136
step 200: train loss 2.4711, val loss 2.4689
step 300: train loss 2.4015, val loss 2.4063
step 400: train loss 2.3552, val loss 2.3574
step 500: train loss 2.3224, val loss 2.3317
step 600: train loss 2.2640, val loss 2.2625
step 700: train loss 2.2086, val loss 2.2156
step 800: train loss 2.1640, val loss 2.1882
step 900: train loss 2.1440, val loss 2.1792
step 1000: train loss 2.0918, val loss 2.1322
step 1100: train loss 2.0515, val loss 2.1047
step 1200: train loss 2.0169, val loss 2.0700
step 1300: train loss 2.0089, val loss 2.0573
step 1400: train loss 1.9713, val loss 2.0360
step 1500: train loss 1.9565, val loss 2.0187
step 1600: train loss 1.9274, val loss 2.0163
step 1700: train loss 1.9125, val loss 2.0203
step 1800: train loss 1.9018, val loss 1.9868
step 1900: train loss 1.8875, val loss 1.9873
step 2000: train loss 1.8646, val loss 1.9785
step 2100: train loss 1.