# Description

You can find the full code for this notebook [here](https://github.com/moritzpail/gpt). In addition to the config that Karpathy uses in his [tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY), this notebook implements a different model hidden dimension, optimizer, and uses GeLU instead of ReLU in ffwd network of the model. In particular, we 
- Use a hidden dimension size of 128 instead of 384 to keep the training more manageable with the available resources (free GPU on Google Colab).
- Use GeLU instead of ReLU in the FFWD network of the GPT as I read that this might give improvement.
- Use the Lion optimizer as I read some [evidence](https://github.com/lucidrains/lion-pytorch) that this might also lead to efficiency gains.

# Imports

In [90]:
# Add line so we don't have to reload notebooks for changes in imported modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [91]:
import torch
from lion_pytorch import Lion

from helpers.load_data import load_data
from helpers.get_batch import get_batch
from helpers.estimate_loss import estimate_loss
from helpers.tokenizer import Tokenizer
from models.gpt import GPT

In [92]:
torch.manual_seed(13)

<torch._C.Generator at 0x11cd56e10>

# Globals

In [93]:
BATCH_SIZE = 32
BLOCK_SIZE = 8
EVAL_INTERVAL = 100
LEARNING_RATE = 3e-4
EVAL_ITERS = 500
MAX_ITERS = 3000
N_EMBED = 128
N_HEADS = 6
N_LAYERS = 6
DROPOUT_RATE = 0.2
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Prepare Data

In [94]:
# Load data
text: str = load_data()

tokenizer = Tokenizer(text)
data = torch.tensor(tokenizer.encode(text))

# Create train and test sets
n = int(len(text) * 0.9)
train_data = data[:n]
test_data = data[n:]

# GPT

In [95]:
gpt = GPT(
    vocab_size=tokenizer.vocab_size,
    n_embed_size=N_EMBED, 
    block_size=BLOCK_SIZE, 
    device=DEVICE,
    n_heads=N_HEADS,
    n_layers=N_LAYERS,
    dropout_rate=DROPOUT_RATE
).to(DEVICE)

## Training loop

In [97]:
# optimizer = torch.optim.AdamW(gpt.parameters(), lr=LEARNING_RATE)
optimizer = Lion(gpt.parameters(), lr=LEARNING_RATE)

for iter in range(MAX_ITERS):
    
    if iter % EVAL_INTERVAL == 0:
        train_loss, val_loss = estimate_loss(
            model=gpt, 
            train_data=train_data,
            valid_data=test_data,
            block_size=BLOCK_SIZE,
            batch_size=BATCH_SIZE,
            eval_iters=EVAL_ITERS,
            device=DEVICE
        )
        print(f"Step {iter}, Train loss: {train_loss:.4f}, Val loss: {val_loss:.4f}")

    xb, yb = get_batch(
        train_data,
        BLOCK_SIZE, 
        BATCH_SIZE,
        device=DEVICE
    )

    logits, loss = gpt(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Step 0, Train loss: 4.3080, Val loss: 4.3146


KeyboardInterrupt: 

In [None]:
start_token = torch.zeros((1, 1)).long().to(DEVICE)
sequence = gpt.generate(start_token, max_len=500, block_size=BLOCK_SIZE)[0].tolist()
print(tokenizer.decode(sequence))


FwJNnjT,M XnmyNwSMAN:n.-oCBTnuoOuhBHBnAVhTi:$boYONn!zRmtIGOcPCwAKWzPVwGWFOO3OId 
actV?.wu?Icxwn?j!L?pIqqKCBIgO,UqiTj,qECo?XdVMk:tVM-:IPpHZVM&IwTIsqaOE?M$pjdMBwy,:!k'dqjOK,3m:oIOMz
Ivc3Wcww?OY?NuPt3JNVDwVIlpAauwEkKDn:MwZ:F.;sq:MIwbNIlnZnpJwhzM-qhCjybYWjN$GISIIWjB
HZXNnwIqYfP?OZwHCiJ;!XHcpABhOKVW?a'u
b'IQ
!KMCP?KJ
l-q?Z,jrzmPa3 wAIKOGaZwwiekWPwNpORauN;;sDYdNkXW$Fuj&!wiIG!?PHAjBlUjCsQDzY!h
aI:ChO3BPCUIq NPVuxEuWB&pM3eaWCAo r PdFI!y.YD'qhTTCXBnkduVskCZBMGBAD ZbwMTcF$HMuKjkK&WBcu&I&I:NT?MnpK.MNWnXjUy
