> Toy ChatGPT

Imagine building a tiny *ChatGPT* of your own (small enough to understand in one sitting, but real enough to see the magic of text prediction come alive). In this notebook we'll follow [Andrej Karpathy](https://karpathy.ai/)'s approach: write the code from scratch, keep it simple, and train a little model that can learn to generate text one character at a time. We're not aiming for power or speed; the goal is to demystify how these models actually work under the hood. By the end, you'll have a hands-on sense of how a GPT-like system can be built step by step, and you'll be able to play with it, experiment, and make it your own.

This is based on Karpathy's [ng-video-lecture](https://github.com/karpathy/ng-video-lecture/tree/master) and [minGPT](https://github.com/karpathy/minGPT) repositories, and the [companion video tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY).

# Setup

In principle, you could run the notebook either in *Colab* or locally. Is the notebook running in *Colab*?

In [None]:
try:
    import google.colab
    running_in_colab = True
except ImportError:
    running_in_colab = False

running_in_colab

## GPU

In order to run this notebook (in a reasonable time) we will make use of the [Grahics Processing Unit](https://en.wikipedia.org/wiki/Graphics_processing_unit) (GPU) provided by the *Colab* environment. To enable it, on the top righ-hand-side of the *Colab* interface, click `Conectar`, `Cambiar tipo de entorno de ejecución`, select `GPU T4`, and then click `Guardar`. `to_device` calls strewn all over the notebook are meant to *move* arrays to the GPU (if available).

If not running in *Colab*, you might want to choose a GPU if several are available. Ignore, if you are running in *Colab*

In [None]:
if not running_in_colab:

    import os
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

## Python libraries

Required `import`s are centralized here.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

Is there a GPU available?

In [None]:
torch.cuda.is_available()

# Data curation

Let us download some text. The code below will download Shakespeare's texts, but you can essentially plug in here any text resource (a book, some webpage...) you like!!

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

The *whole* file is read into memory.

Se lee en memoria el archivo *completo*.

In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print('number of characters read: ', len(text))

Let us take a look at our dataset.

<font color='red'>TO-DO</font>: Show the first 200 characters.

## Vocabulary

As usual, in order to turn text into numbers we need a **vocabulary** allowing to assemble, piece by piece, the whole dataset. For the sake of simplicity, let that be individual characters that show up in the text.

In [None]:
chars = sorted(set(text))
vocab_size = len(chars)
print(rf'Vocabulary ({len(chars)} elements) is: {''.join(chars)} ')

We will associate an index (`i`) with every element (string, `s`) in the vocabulary. Any mapping is fine, so the easy thing is associate every character with its index in the `list`.

In [None]:
stoi = { ch:i for i, ch in enumerate(chars) }

(`stoi` as in "string to integer")

You can use it to find out the index of any character you like, e.g.,

<font color='red'>TO-DO</font>: What is the index of character `k`?

We would also like the inverse mapping, i.e., from index to character

In [None]:
itos = chars

<font color='red'>TO-DO</font>: What is the character associated with index `12`?

We exploit the above mappings (`dict`) to make functions able to operate, respectively, on **sequences** of characters (for *encoding*)...

In [None]:
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
print(encode("hii there"))

...and numbers (for *decoding*)

In [None]:
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
print(decode(encode("hii there")))

Let us `encode` (characters to numbers) the dataset into a *torch* `Tensor`

In [None]:
data = torch.tensor(encode(text), dtype=torch.long)

It has the same number of elements as `text` above, and every one of the elements is a 64-bits integer (`torch.int64`).

In [None]:
assert len(text) == len(data)
print(data.dtype)

Let us print the first characters, now represented as numbers

In [None]:
print(data[:200])

## Training/validation split

Data are split into *training* and *validation* sets, so that we have a way of knowing how well the model is generalizing.

In [None]:
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

## Block size

Since we cannot process *all* the data at once (unless you have a very small dataset, which would probably end up producing a crappy model), we need to *chunk* it. Let us consider chunks of size

In [None]:
block_size = 8

[GPT-4](https://en.wikipedia.org/wiki/GPT-4), for instance, has a block size (aka, *context length*) of tens of thousands of *tokens*, every one of them encompassing more than one character. Hence, keep in mind that we are, of course, looking at a toy example

Let us look at the first block

In [None]:
train_data[:block_size]

When processing every block the goal is to predict a character given *all* the previous ones: for predicting the 2nd character we will only use the 1st one, when predicting the 3rd characer, we'll use the 1st and 2nd...and so on and so forth. In principle, this would mean that, for every block size, we would be making `block_size - 1` predictions. An extra character, the `(block_size+1)`-th character, is always considered so that we have exactly `block_size` predictions. Hence, the above block will yield the following prediction tasks:

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'When input is {context.tolist()}, the target: {target}')

Ultimately, this is a *multiclass* classification problem: each prediction is not just a single label, but a **full probability distribution** over all possible classes. In our case, it reflects how likely each character in the vocabulary is to be the next one.

## Batching
Since we want to get the most of GPUs (parallel processing), we will actually pack together and process several blocks at the same time...as many as

In [None]:
batch_size = 4

Let us fix a pseudo-random numbers generator (PRNG) *seed* so that we get always the same results (when requesting "random" numbers below).

In [None]:
torch.manual_seed(1337)

An auxiliary function to assemble a random batch, either from the *training* or from the *validation* set (depending of the value of the `split` parameter).

In [None]:
def get_batch(split: str):
    
    # either the training or validation set
    data = train_data if split == 'train' else val_data

    # a random index (for each block in the batch) that is followed by at least `block_size` characters so that we can extract a full block
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # notice the `stack`ing
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    return x, y

Let us request a batch from the *training* set

In [None]:
xb, yb = get_batch('train')

...and take at look at inputs and outputs.

- **Input** and its dimensions (`bath_size`, `block_size`)

In [None]:
print(xb)
print(xb.shape)

- **Target** (output) and its dimensions (`bath_size`, `block_size`)

In [None]:
print(yb)
print(yb.shape)

<font color='red'>TO-DO</font>: show the *text* represented by `yb` (the outputs or targets)

Notice

In [None]:
xb.shape == yb.shape

The above batch poses the following prediction problems:

In [None]:
# for every sequence in the batch...
for i_b, b in enumerate(range(batch_size)):

    print(f'{i_b}-th element in the batch:')
    
    # for every element in the sequence...
    for t in range(block_size):
        
        # every character in the sequence up to and including (hence the `+1`) t
        context = xb[b, :t+1]
        
        # by construction (above), `yb[b,t]` is the target for the sequence up to and including t
        target = yb[b,t]
        
        print(f"When input is {context.tolist()}, the target: {target}")

    print('-'*5)

What must go into the neural network (NN) are actually `tensor`s and **not** `list`s of *variable* size. The input to the NN will be

In [None]:
xb

and the corresponding output (*target*).

In [None]:
yb

That yields `bath_size` $\times$ `block_size` (the dimesions of `xb` and `yb`) *independent* predictions for the model to learn ([Karpathy's explanation](https://youtu.be/kCc8FmEb1nY?t=1281)). They are all processed simultaneously.

# Training

## Parameters

Let us set some (hyper)parameters we will actually be using for training the model

- Seen above

In [None]:
block_size = 32
batch_size = 16

- How many (random) batches to train the model?

In [None]:
# max_iters = 5000
max_iters = 500

- Some architecture-specific parameters

In [None]:
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

* In connection with training

In [None]:
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

## Model

This is the definition of the NN (based on *transformers*). **Skip** for now: in this course we are not yet interested in the implementation details, even though, as you can see, the code is actually not so large. If you delve into the code (again, not required), keep in mind that this was written with an educational frame of mind, and there are some very questionable programming practices.

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# Renamed from https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/bigram.py#L61
class ToyLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

All the above mumbo-jumbo is just meant to define a (huge) function, instantiated as

In [None]:
model = ToyLanguageModel().to(device)

Let us evaluate `model` at the above batch, `xb` (not built considering the parameters we are assuming now, but still OK).

In [None]:
yb_est = model(xb.to(device))

<font color='red'>TO-DO</font>: What do you get? Explain the dimensions of any `Tensor`.

## Training loop

A function to estimate the *loss*, a measure of how well we are doing.

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X.to(device), Y.to(device))
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

*Boilerplate* for training the model. This is already pretty understandable for you right now. In any case, just focus on knowing "what's going on" at a high level.

In [None]:
model = ToyLanguageModel()

m = model.to(device)

# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb.to(device), yb.to(device))
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

<font color='red'>TO-DO</font>: Is it guaranteed that you have trained on the *whole* dataset (i.e., that every character has been used)?

<font color='red'>TO-DO</font>: Take a look at the dimensions of `logits` (from the last iteration in the training loop), and explain them.

In order to exploit the trained model to generate some new text, we need to set up a context (sort of a starting point).

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)

<font color='red'>TO-DO</font>: What is the text associated with the context?

Let us generate some new text (up to 2,000 characters) using the above context

In [None]:
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

# Experiments

- <font color='red'>TO-DO</font>: Train the model for a shorter time (i.e., on a smaller number of batches, say 10), and try generating text. What do you observe? How does the `loss` at the end of traing compare against that of the first model you trained?

- <font color='red'>TO-DO</font>: Try different block sizes. Can you get better results?

- <font color='red'>TO-DO</font>: Try on a different, smaller, dataset (other than Shakespeare's). You could, for instance, *inject* the lyrics of your favorite song (that would be a *tiny* dataset) in the `text` variable up above. What do you observe? Compare the values of the loss for *training* and *validation*.

# Sample questions

## What does the *batch size* control during training?
- [ ] How many layers the model has  
- [ ] The number of training epochs
- [ ] How many small data chunks are processed in parallel before updating model parameters
- [ ] The maximum number of characters generated

---

## What is the model trying to learn during training?
- [ ] The meaning of words and sentences  
- [ ] The grammatical rules of English  
- [ ] The emotional tone of Shakespeare’s plays
- [ ] The probability of the next character given the previous ones