# Let's build GPT
ChatGPT is a language model developed by OpenAI. It's designed to understand and generate human-like text based on the input it receives. You can use it for various natural language processing tasks, such as answering questions, having conversations, generating text, and more.

GPT stands for "Generative Pre-trained Transformer." It's a type of model that's trained on a large amount of text data to understand and generate human-like text. The "transformer" part refers to the model's architecture, which was introduced in the landmark paper Attention Is All You Need in 2017. 

In this notebook, we will build a small version of the GPT model that powers ChatGPT. For simplicity and speed...
- our model will character tokens instead of subword tokens
- we will be using the small Tiny Shakespeare dataset instead of (a big chunk of) the entire internet
- our model will have much fewer parameters than the billions 

This notebook is based on Andrei Karpathy's youtube video [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY) 

## Data
We start with the tiny shakespseare dataset, which is essentially just all of Shakespeare's works compiled into a single text file.

In [8]:
# download the file
!wget -q https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [11]:
# read the file in as utf8 
with open('input.txt', 'r', encoding="utf-8") as f:
    text = f.read()

In [12]:
# inspect length of characters
print("length of dataset in characters", len(text))

length of dataset in characters 1115394


In [13]:
# lets look at the data
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



### Vocabulary
Since we are using character tokens, our vocabularly will just be the different characters that appear in our dataset.

Note that the first character is the newline, and the second is a space. 

In [32]:
# print the unique characters in the file
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print('total chars:', vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
total chars: 65


### Tokenizer
A "tokenizer" is a component used in natural language processing (NLP) to break down text into smaller units called "tokens". These tokens are typically represented as integers instead raw strings. An "encoder" allows us turn tokens represented as strings into integers, and a "decoder" allows us to turn our tokens represented as integers back into strings.

We have a very simple character level tokenizer. There are many different tokenizers, like Google's [sentenpceiece](https://github.com/google/sentencepiece) or OpenAI's [tiktoken](https://github.com/openai/tiktoken). These tokenizers operate on a subword level, which means their vocabularly is much larger (since there are many more permutations of subwords than characters). But the general idea remains the same, we are just turning strings into integers and vice versa.

In [15]:
# codebook for characters
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}

# encode and decode functions
encode = lambda x: [stoi[ch] for ch in x]
decode = lambda x: ''.join([itos[i] for i in x])

print(encode("hi there"))
print(decode(encode("hi there")))

[46, 47, 1, 58, 46, 43, 56, 43]
hi there


Since we have a character level encoder, every character will turn into a single integer. If we encode our dataset, we end up with the same number of integers as we had characters in our original text. 

In [16]:
# lets take a look at the encoded data
import torch 
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Train/Val Split
We will split our dataset into train (90%) and validation (10%).

In [17]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

### Context 
Our goal is to generate the next character given some context of previous characters. We have a "block size" or "context length" of 8 which means our model will consider the 8 preceding characters as context.

In [18]:
# aka context length
block_size = 8 

When we sample our dataset, we grab a block of 8 characters of context plus and one character as the target. The target character is the one we are learning from during training, trying to predict during evaluation, and actually generating during inference.

In [19]:
# sample the first block of text
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

Each block actually contains 8 different examples, one for each possible sequence starting with the first character. It is important to show our model examples with fewer than block_size characters, so that it is able to learn how to generate text with little as one character context.

In [20]:
# creating examples out of a block of text
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print("context", context, "--> target", target)

context tensor([18]) --> target tensor(47)
context tensor([18, 47]) --> target tensor(56)
context tensor([18, 47, 56]) --> target tensor(57)
context tensor([18, 47, 56, 57]) --> target tensor(58)
context tensor([18, 47, 56, 57, 58]) --> target tensor(1)
context tensor([18, 47, 56, 57, 58,  1]) --> target tensor(15)
context tensor([18, 47, 56, 57, 58,  1, 15]) --> target tensor(47)
context tensor([18, 47, 56, 57, 58,  1, 15, 47]) --> target tensor(58)


Let's prepare our data into batches of blocks to be processed in paralell. 

Since we have batch size 4 and block size 8, one batch will contain an `4x8` tensor X and a `4x8` tensor Y.
- Each row in the `4x8` tensor X contains 8 different example contexts, one for each possible sequence starting with the first character, for a total of 32 contexts. 
- Each element in the `4x8` tensor Y contains a single target, each corresponding to one of the 32 possible contexts in X.    

In [21]:
torch.manual_seed(1337)
batch_size = 4 # how many sequeunces to process in parallel
block_size = 8 # how many characters per sequence 

def get_batch(split):
    # generate a small batch of data of inputs x and targets y 
    data = train_data if split == 'train' else val_data
    # randomly sample a bunch of block_size length sequences
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # the sequence
    x = torch.stack([data[i:i+block_size] for i in ix]) 
    # the target (next character)
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs')
print(xb.shape)
print(xb)
print('targets')
print(yb.shape)
print(yb)

print('------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print("context", context, "--> target", target)

inputs
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
------
context tensor([24]) --> target tensor(43)
context tensor([24, 43]) --> target tensor(58)
context tensor([24, 43, 58]) --> target tensor(5)
context tensor([24, 43, 58,  5]) --> target tensor(57)
context tensor([24, 43, 58,  5, 57]) --> target tensor(1)
context tensor([24, 43, 58,  5, 57,  1]) --> target tensor(46)
context tensor([24, 43, 58,  5, 57,  1, 46]) --> target tensor(43)
context tensor([24, 43, 58,  5, 57,  1, 46, 43]) --> target tensor(39)
context tensor([44]) --> target tensor(53)
context tensor([44, 53]) --> target tensor(56)
context tensor([44, 53, 56]) --> target tensor(1)
context 

## Model
Lets start with the simplest model possible, which is a bigram language model. 

### Forward pass
Below we implement the bigram language model using an embedding with exactly `vocab_size x vocab size`. Embedding a single integer between `0` and `vocab_size-1` would return a tensor of length `vocab_size`. This acts like a lookup table, where passing in a row index between `0` and `vocab_size-1` would return a row with length `vocab_size`.

If we pass in a multi-demensional vector as input, the embedding simply returns a tensor with the same dimensions, excecpt each integer gets turned into a vector of `vocab_size`. For example, if we pass in an input with dimensions `BxT`, then the output will be have dimension `BxTxC`.
- `B` is the "batch" dimension, indicating which batch we are in, equal to `batch_size`
- `T` is the "time" dimension, indicating our position in the sequence, equal to `block_size`
- `C` is the "channel" dimension, indicating which neuron we are talking about, equal to `vocab_size`

In [43]:
import torch 
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets):
        # idx and targets are both (B,T) tensor of integers
        # B = batch size, T = block size, C = vocab size
        logits = self.token_embedding_table(idx) # (B, T, C)

        # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
        # PyTorch always expects C as the 2nd dimension
        # so we combine B*T into the 1st dimension for both logits and targets
        B, T, C = logits.shape
        loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))

        return logits, loss

Our loss is terrible, because we haven't yet trained our model. 

We expect our loss to be `-ln(1/65)` or about `4.1744` if our weights were uniform but we actually get a bit worse loss of `4.8786` which is becasue we have some entropy in our initial weights. 

In [44]:

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

torch.Size([4, 8, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


### Generate 
Let's add the ability to generate characters to our model. 

Generate takes some context and uses it to generate `max_new_tokens` more characters. 

For each new token up to `max_new_tokens`:
- call the forward pass with the given context `idx` (without targets) to get the logits
- "pluck out" the logits for just the last position in dimension T (since our forward pass acts on all `BxT` inputs and returns `BxTxC`)
- apply softmax to the last position (`BxC`) to tarnsform into probabilities
- sample from the probability distribution to generate the next character
- append generated character to context and "shift" the context window

In [50]:
import torch 
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        # don't compute loss if targets not given (used for generation)
        if targets is None:
            loss = None 
        else:
            B, T, C = logits.shape
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context 
        for _ in range(max_new_tokens):
            # get the predictions 
            logits, loss = self(idx)
            # focous only on the last time step 
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence 
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

To generate text, we just have to pass in an initial context of a single newline character (pretty reasonable) and then decode the output from integers back into character strings.

Our output is terrible (just like our loss) because we haven't actually trained our model yet.

In [91]:
m = BigramLanguageModel(vocab_size)

# initial context is just a 0 (new line character) with shape 1x1 (1 character, 1 batch)
idx = torch.zeros((1, 1), dtype=torch.long)

# generate 100 new tokens
res = m.generate(idx, max_new_tokens=100)

# since generate returns a batch of sequences, we just take the first one
res0 = res[0]

# decode the sequence of indices into characters
print(decode(res0.tolist()))


yXkkf&K?-?wa&TeHDI.m3k-SyoEnwDnN..JMckUw LMT3tqIgQrPt$JUtenuuUEwRuKtF3s''!rSN.K&AjG3soHYcCbSfk-.NljT


## Optimization
Now lets setup our optimization routine. We will use the AdamW optimizer. 

- SGD (Stochastic Gradient Descent): A fundamental optimization algorithm used in machine learning and deep learning. It updates model parameters by computing gradients using randomly selected small batches of data, making it "stochastic." It's widely used for training neural networks and other machine learning models.

- Adam (Adaptive Moment Estimation): A popular optimization algorithm that improves convergence and training speed compared to traditional SGD. It maintains moving averages of gradients and adapts learning rates for each parameter. It's known for its efficiency in practice.

- AdamW: A modification of the Adam optimizer designed to handle weight decay (L2 regularization) more effectively. It separates weight decay from the optimization process, making it better at controlling overfitting during the training of deep neural networks. It's a preferred choice for tasks where regularization is important.

We set the learning rate to 1e-3 which is a decent setting for small networks. 


In [92]:
# create a pytorch optimizer 
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

Below is a typical training loop. Loss looks much better

In [93]:
batch_size = 32 
for steps in range(10000):
    # sample a batch of data 
    xb, yb = get_batch('train')
    # forward pass
    logits, loss = m(xb, yb)
    # backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

2.414367914199829


Generated text also looks much better.

In [94]:
idx = torch.zeros((1, 1), dtype=torch.long)
res = m.generate(idx, max_new_tokens=300)
print(decode(res[0].tolist()))


E:

OLume hety u?'s R:
Fr y
f t Burgowe mak RAnofre l.
LABeduthoupr co am, prrin ay,
TI inghenatofofrsoreth!-cawhe fon le;
Y;
IVe atlQNotorit.
DO:
Li'dste fobe menore isous
pee.

ckl ld mayavee, or,
ADWemerds wed t be nds st to wod we&?riste God wisteiVI I y inet l tis MEYcokl:

KIn, kealow inthanom


In [38]:
# consider the followin toy example 
torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels 
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [40]:
# we want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev, 0)

In [42]:
# version 2 
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) -----> (B, T, C)
torch.allclose(xbow, xbow2)

True

In [43]:
# version 3 
tril = torch.tril(torch.ones(T, T)) 
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [46]:
torch.backends.mps.is_available()

True

In [36]:
from torch.nn import functional as F

# version 4: self-attention!
torch.manual_seed(1337)
B, T, C = 4, 8, 32 # batch, time, channels 
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention 
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
wei = q @ k.transpose(-1, -2) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T)) 
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v 
out.shape

torch.Size([4, 8, 16])

In [49]:
wei 

tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
         [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
         [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
         [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1687, 0.8313, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2477, 0.0514, 0.7008, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4410, 0.0957, 0.3747, 0.0887, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0069, 0.0456, 0.0300, 0.7748, 0.1427, 0.0000, 0.0000, 0.0000],
         [0.0660, 0.089

In [16]:
import torch
import torch.nn as nn
l1 = nn.Linear(1, 1)
input = torch.randn(1, 1)
output = l1(input)
print(input)
print(output)

tensor([[-1.9445]])
tensor([[-0.1746]], grad_fn=<AddmmBackward0>)


In [33]:
l1 = nn.Linear(1, 1)
l1.weight

Parameter containing:
tensor([[-0.8377]], requires_grad=True)