# NanoGPT Notebook

Based on Andrej Karpathy's NanoGPT, with elements from "Attention Is All You Need"

References:



*   [Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy)
*   Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).





Run the cell below to upload a training dataset. I used the Tiny Shakespeare [corpus](https://github.com/intchrkl/nanogpt/blob/main/input.txt).

In [17]:
from google.colab import files
uploaded = files.upload()

Saving input.txt to input (1).txt


Run the cell below to load the model and helpers.

In [22]:
import torch
import torch.nn as nn
import math
from torch.nn import functional as F

def get_model_and_optimizer(config, vocab_size):
    class Head(nn.Module):
        # A self-attention head

        def __init__(self, head_size):
            super().__init__()
            self.key = nn.Linear(config['n_embd'], head_size, bias=False)
            self.query = nn.Linear(config['n_embd'], head_size, bias=False)
            self.value = nn.Linear(config['n_embd'], head_size, bias=False)
            self.register_buffer('tril', torch.tril(torch.ones(config['block_size'], config['block_size'])))
            self.dropout = nn.Dropout(config['dropout'])

        def forward(self, x):
            B, T, C = x.shape
            k = self.key(x) # (B, T, C)
            q = self.query(x) # (B, T, C)

            # From Attention Is All You Need paper:
            wgt = q @ k.transpose(-2, -1) / math.sqrt(C) # QK^T/sqrt(d_k)
            wgt = wgt.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
            wgt = F.softmax(wgt, dim=-1)
            wgt = self.dropout(wgt)

            v = self.value(x)
            out = wgt @ v
            return out

    class MultiHeadAttention(nn.Module):
        # A collection of self-attention heads in parallel

        def __init__(self, num_heads, head_size):
            super().__init__()
            # Multiple heads in parallel
            heads_list = [Head(head_size) for _ in range(num_heads)]
            self.heads = nn.ModuleList(heads_list)
            self.proj = nn.Linear(num_heads * head_size, config['n_embd'])
            self.dropout = nn.Dropout(config['dropout'])


        def forward(self, x):
            out = torch.cat([h(x) for h in self.heads], dim=-1)
            out = self.dropout(self.proj(out))
            return out

    class FeedForward(nn.Module):

        def __init__(self, n_embd):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(n_embd, 4 * n_embd),
                nn.ReLU(),
                nn.Linear(4 * n_embd, n_embd),
                nn.Dropout(config['dropout'])
            )

        def forward(self, x):
            return self.net(x)

    class Block(nn.Module):

        def __init__(self, n_embd, n_heads):
            super().__init__()
            head_size = n_embd // n_heads
            self.self_attn = MultiHeadAttention(n_heads, head_size)
            self.feed_fwd = FeedForward(n_embd)
            self.ln1 = nn.LayerNorm(n_embd)
            self.ln2 = nn.LayerNorm(n_embd)

        def forward(self, x):
            x = x + self.self_attn(self.ln1(x))
            x = x + self.feed_fwd(self.ln2(x))
            return x

    class GPT(nn.Module):

        def __init__(self):
            super().__init__()
            self.token_embedding_table = nn.Embedding(vocab_size, config['n_embd'])
            self.position_embedding_table = nn.Embedding(config['block_size'], config['n_embd'])
            self.blocks = nn.Sequential(*[Block(config['n_embd'], config['n_head']) for _ in range(config['n_layer'])])
            self.ln_final = nn.LayerNorm(config['n_embd'])
            self.lm_head = nn.Linear(config['n_embd'], vocab_size)

        def forward(self, idx, targets=None):
            B, T = idx.shape

            tok_embd = self.token_embedding_table(idx)
            pos_embd = self.position_embedding_table(torch.arange(T, device=config['device']))

            x = tok_embd + pos_embd
            x = self.blocks(x)
            x = self.ln_final(x)
            logits = self.lm_head(x)

            if targets is None:
                loss = None
            else:
                B, T, C = logits.shape
                logits = logits.view(B * T, C)
                targets = targets.view(B * T)
                loss = F.cross_entropy(logits, targets)

            return logits, loss

        def generate(self, idx, max_new_tokens):
            for _ in range(max_new_tokens):
                idx_and_context = idx[:, -config['block_size']:]
                logits, _ = self.forward(idx_and_context)
                logits = logits[:, -1, :]
                probs = F.softmax(logits, dim=-1)
                idx_next = torch.multinomial(probs, num_samples=1)
                idx = torch.cat((idx, idx_next), dim=1)
            return idx

    model = GPT().to(config['device'])
    optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
    return model, optimizer

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Simple encoding for each letter (token)
char_to_int = { c:i for i, c in enumerate(chars)}
int_to_char = { i:c for i, c in enumerate(chars)}

# Encode a string S
def encode(S):
    return [char_to_int[c] for c in S]

# Decode a list of integers L
def decode(L):
    return ''.join([int_to_char[i] for i in L])

# Encoding the entire corpus and storing it in a Tensor
data = torch.tensor(encode(text), dtype=torch.long)

# Partition data into train and validation sets
percent_train = 0.9 # portion of data to be used as training set, rest is val
train_data = data[:int(0.9 * len(data))]
val_data = data[int(0.9 * len(data)):]

# Samples a mini-batch of sequence from either the training or validation set
def get_batch(split, config):
    data = train_data if split == 'train' else val_data
    start_idxs = torch.randint(0, len(data) - config['block_size'], (config['batch_size'],))
    input_tensors = torch.stack([data[i:i + config['block_size']] for i in start_idxs])
    target_tensors = torch.stack([data[i + 1:i + config['block_size'] + 1] for i in start_idxs])
    return input_tensors.to(config['device']), target_tensors.to(config['device'])

@torch.no_grad()
def estimate_loss(model, config):
    model.eval()
    losses = {'train': 0, 'val': 0}
    for split in ['train', 'val']:
        loss_total = 0
        for _ in range(config['eval_iters']):
            xb, yb = get_batch(split, config)
            _, loss = model(xb, yb)
            loss_total += loss.item()
        losses[split] = loss_total / config['eval_iters']
    model.train()
    return losses

def train(model, optimizer, config):
  for step in range(config['max_iters']):
    if step % config['eval_interval'] == 0:
        losses = estimate_loss(model, config)
        print(f"Step {step}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train', config)
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

  print("Training completed.")

def parse_output(generated):
  return decode(generated[0].tolist())

Run the cell below to train using the hyperparameters specified in `config`

In [23]:
config = {
    'batch_size' : 16, # no. of independent sequences to process in parallel
    'block_size' : 32, # max no. of tokens to use for context in predictions
    'max_iters' : 5000,
    'eval_iters' : 200,
    'eval_interval' : 100,
    'learning_rate' : 1e-3,
    'n_embd' : 64,
    'n_head' : 4,
    'n_layer' : 4,
    'dropout' : 0.0,
    'device' : 'cuda' if torch.cuda.is_available() else 'cpu',
    'seed': 18916
}

torch.manual_seed(config['seed'])
model, optimizer = get_model_and_optimizer(config, vocab_size)
train(model, optimizer, config)

Step 0: train loss 4.3885, val loss 4.3838
Step 100: train loss 2.6421, val loss 2.6538
Step 200: train loss 2.5216, val loss 2.5563
Step 300: train loss 2.4412, val loss 2.4612
Step 400: train loss 2.3608, val loss 2.3685
Step 500: train loss 2.3232, val loss 2.3375
Step 600: train loss 2.2529, val loss 2.2728
Step 700: train loss 2.2138, val loss 2.2312
Step 800: train loss 2.1830, val loss 2.2324
Step 900: train loss 2.1269, val loss 2.1610
Step 1000: train loss 2.1065, val loss 2.1292
Step 1100: train loss 2.0682, val loss 2.1196
Step 1200: train loss 2.0474, val loss 2.0977
Step 1300: train loss 2.0248, val loss 2.0710
Step 1400: train loss 1.9929, val loss 2.0548
Step 1500: train loss 1.9700, val loss 2.0378
Step 1600: train loss 1.9457, val loss 2.0304
Step 1700: train loss 1.9334, val loss 2.0108
Step 1800: train loss 1.9188, val loss 2.0038
Step 1900: train loss 1.9048, val loss 1.9874
Step 2000: train loss 1.8856, val loss 1.9782
Step 2100: train loss 1.8704, val loss 1.9792


The model converges quite well, but we can try to tune the hyperparameters even further.

View the output \(first 2000 tokens\) generated by the model.


In [24]:
context = torch.zeros((1, 1), dtype=torch.long, device=config['device'])
generated = model.generate(context, max_new_tokens=2000)
print(parse_output(generated))


Is now to is we his brother to to mady ques heaven?

That have him mader:
Were the dold, Looth, be it!
Made, here on yours, as Stear him some that Dead,
Befitedil now to that is fortusing dreath,
To her honour thoight
By for of branished minen master.

FRIAR MARDINTA:
Not is servines at I was be ven light the grace,
Thate to burn the madaber to mind.
Yeas, Farewell hence befores, of my wound kind
Now, my love lies our forrow heavens.

StOLANNES:
What,--'is well
Stens your on will bidt Bariemy swaget,
Mareth, being fark i counteny taken,
Of with daless think his promieve any and that
yet abours I aw KI's now will is assencharvence,
These to been affer's teart,'d weepon mised;
Before obgatingn Dry, balieve aliancisent the maides,
here on to there-seath affurnc'd I colforce.

CAMILLO:
Here safe it.

StREN MERVOLI:
Mastish, yet, beshand, kings, all,
This abouther evill: they sispretaking from worshiold, brog dy ttio miseece we love:
I do sland to caught bloteres, Spored,
To me atort man k

Increasing the `block_size` from `32` to `256` to give the model more context when predicting tokens.

In [25]:
config = {
    'batch_size' : 16, # no. of independent sequences to process in parallel
    'block_size' : 256, # max no. of tokens to use for context in predictions
    'max_iters' : 5000,
    'eval_iters' : 200,
    'eval_interval' : 100,
    'learning_rate' : 1e-3,
    'n_embd' : 64,
    'n_head' : 4,
    'n_layer' : 4,
    'dropout' : 0.0,
    'device' : 'cuda' if torch.cuda.is_available() else 'cpu',
    'seed': 18916
}

torch.manual_seed(config['seed'])
model, optimizer = get_model_and_optimizer(config, vocab_size)
train(model, optimizer, config)

Step 0: train loss 4.3071, val loss 4.3002
Step 100: train loss 2.6442, val loss 2.6537
Step 200: train loss 2.5335, val loss 2.5395
Step 300: train loss 2.4934, val loss 2.5009
Step 400: train loss 2.4637, val loss 2.4804
Step 500: train loss 2.4314, val loss 2.4455
Step 600: train loss 2.3847, val loss 2.3958
Step 700: train loss 2.3186, val loss 2.3406
Step 800: train loss 2.2490, val loss 2.2689
Step 900: train loss 2.2053, val loss 2.2406
Step 1000: train loss 2.1573, val loss 2.1956
Step 1100: train loss 2.1159, val loss 2.1609
Step 1200: train loss 2.0722, val loss 2.1300
Step 1300: train loss 2.0301, val loss 2.0987
Step 1400: train loss 1.9958, val loss 2.0764
Step 1500: train loss 1.9672, val loss 2.0494
Step 1600: train loss 1.9395, val loss 2.0274
Step 1700: train loss 1.9042, val loss 2.0042
Step 1800: train loss 1.8754, val loss 1.9772
Step 1900: train loss 1.8524, val loss 1.9644
Step 2000: train loss 1.8238, val loss 1.9437
Step 2100: train loss 1.8044, val loss 1.9249


This slightly improves the loss, and generates slightly more interpretable output.

In [26]:
context = torch.zeros((1, 1), dtype=torch.long, device=config['device'])
generated = model.generate(context, max_new_tokens=2000)
print(parse_output(generated))


I enemy, yields his brothed to to medy?

BUSTOHAS:
Yet are heriks, I perpuate see: That entime,
Even, sir, what I last I speak Stand him boy?

CLARDIO:

Butife?

Nurse:
Hence is four sinuce heir, how woundels, thought
You by natural the best noble seen.

LADWONTES:
Go convilasting, my thy was be venol not the is

LADY ANer to we the amagaret to my lords any,
And sweetn From thyself you wound the bick.

CAPULEY:
Yor a your of her.

QUEEN MARGARET:
Gal, I'll you my my faice on will bitt Barke
Trews buy angerad her, behold, count
Where she barws thou guest up white bhose,
Hath arreted in their moune with state with is
With againster, forsorce at his armounted that her from
The for of do graciour repest in their once
Marggele, see, here of their that other,
How 'my father, nor minds.

Provost:
Confersst it to imans; no is disciton.

Servixd OF AUMP Spritizer:
I she is give of is retacher from worship,
Strudg dyst accuse her we angal her offf
Both caught all saw if horse,
To me hourt can q

Increasing the model size: `n_embd` from `64` to `128`, `n_layer` from `4` to `8`. The model takes much longer to train (~10 min).

In [27]:
config = {
    'batch_size' : 16, # no. of independent sequences to process in parallel
    'block_size' : 256, # max no. of tokens to use for context in predictions
    'max_iters' : 5000,
    'eval_iters' : 200,
    'eval_interval' : 100,
    'learning_rate' : 1e-3,
    'n_embd' : 128,
    'n_head' : 4,
    'n_layer' : 8,
    'dropout' : 0.0,
    'device' : 'cuda' if torch.cuda.is_available() else 'cpu',
    'seed': 18916
}

torch.manual_seed(config['seed'])
model, optimizer = get_model_and_optimizer(config, vocab_size)
train(model, optimizer, config)

Step 0: train loss 4.3153, val loss 4.3099
Step 100: train loss 2.5195, val loss 2.5239
Step 200: train loss 2.4634, val loss 2.4792
Step 300: train loss 2.4036, val loss 2.4231
Step 400: train loss 2.2503, val loss 2.2721
Step 500: train loss 2.1366, val loss 2.1719
Step 600: train loss 2.0100, val loss 2.0822
Step 700: train loss 1.9273, val loss 2.0185
Step 800: train loss 1.8418, val loss 1.9619
Step 900: train loss 1.7773, val loss 1.9186
Step 1000: train loss 1.7268, val loss 1.8688
Step 1100: train loss 1.6749, val loss 1.8314
Step 1200: train loss 1.6374, val loss 1.8008
Step 1300: train loss 1.6070, val loss 1.7752
Step 1400: train loss 1.5767, val loss 1.7648
Step 1500: train loss 1.5544, val loss 1.7426
Step 1600: train loss 1.5224, val loss 1.7275
Step 1700: train loss 1.5113, val loss 1.6978
Step 1800: train loss 1.4859, val loss 1.6838
Step 1900: train loss 1.4737, val loss 1.6817
Step 2000: train loss 1.4561, val loss 1.6634
Step 2100: train loss 1.4484, val loss 1.6480


Loss is much lower but the model takes much longer to generate a response. The response also better resembles Shakespeare. However, we may be overfitting to this training set.

In [28]:
context = torch.zeros((1, 1), dtype=torch.long, device=config['device'])
generated = model.generate(context, max_new_tokens=2000)
print(parse_output(generated))



Gentle-main; which brow'd! I will tought, the foul friar; and
his mane-puase semblance, he must all of it.

GLOUCESTER:
You say this you mistre, I have foried
itself to the good suits out recity.

KOME EORD OVERD:

NORTHUMBERLAND:
I will plen my issue.

LUCENTIO:
Thou art last you may slaves be venominy.

Musician:
You entertainted amazened my mind.

LUCIO:
No, sir, nor before you humband?

DUCHESS OF YORK:
You shall for you must show't, and we call,--
Boy! good my lord. How farest that it!

DUKE OF YORK:
Ay, in yea, wi come?

DUCHESS OF YORK:


KING RICHARD II:
This that I know you welll againe with state.

KING RICHARD II:
I' the war too.

KING EDWARD IV:
'Twas sure is be? who dost about on man?

BUCKINGHAM:
I are groan, my mirth grave, all my groan,
Harry, I comfort, and myself in have month.

ROMIO:
Tell him for all the better to didst all any that shame.
If it thou sister'st but sad-worn in this diad's pace, and
my young most to langually against
that if wory goldeness-or action

Generating with a prompt

In [30]:
prompt = "First Citizen:"
context = torch.tensor([encode(prompt)], dtype=torch.long, device=config['device'])
generated = model.generate(context, max_new_tokens=2000)
print(parse_output(generated))

First Citizen:
Comfort, sir, to think, unkner here come.

Second Murderer:
Be sit off:
Ere he intend is in villain, if you cannot too mine!

MONTAGUE:
I players, then, bery on all is now:
What's, and season him?

Shepherd:
Which Mifles, and keep thou find the man
To put thy salicate this; then but do make some myself!

PAULINA:
O, thy queen!
I should say, and mine revenge: I will promise thou
be to shins: if thou refuse our birds, flesh,
Thou art art well as thy Clifford,
Till unstand thy companied! If you king to him.
Thou loving from his darefic broke his
Widoward's glassing actions and walk'd of him
This angainst indeed-though of them,
And womed thou he not with disdain.

KING RICHARD III:
How, more indeed?

Second to Roman.

QUEEN ELIZABETH:
I have much rather when the rest, adieu.

QUEEN ELIZABETH:
Go to make them, for thou hast his mourn.

KING RICHARD II:
Thou doest me, for she stands upt man o'er.

Keeper:
Either, but not bosour his rapired!

KING HENRY VI:
Perhaps, 'tis he ima