# Description

You can find the full code for this notebook [here](https://github.com/moritzpail/gpt). In addition to the config that Karpathy uses in his [tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY), this notebook implements a different model hidden dimension, optimizer, and uses GeLU instead of ReLU in ffwd network of the model. In particular, we
- Use a hidden dimension size of 128 instead of 384 to keep the training more manageable with the available resources (free GPU on Google Colab). We also use smaller values for n_layers, n_heads, and block_size.
- Use GeLU instead of ReLU in the FFWD network of the GPT as I read that this might give improvement.
- Use the Lion optimizer as I read some [evidence](https://github.com/lucidrains/lion-pytorch) that this might also lead to efficiency gains.

# Imports

In [1]:
!git clone https://github.com/moritzpail/gpt.git

Cloning into 'gpt'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 42 (delta 11), reused 37 (delta 6), pack-reused 0 (from 0)[K
Receiving objects: 100% (42/42), 450.89 KiB | 13.26 MiB/s, done.
Resolving deltas: 100% (11/11), done.


In [2]:
# Add line so we don't have to reload notebooks for changes in imported modules
%load_ext autoreload
%autoreload 2

In [3]:
%cd gpt

/content/gpt


In [5]:
!pip install lion_pytorch

Collecting lion_pytorch
  Downloading lion_pytorch-0.2.2-py3-none-any.whl.metadata (618 bytes)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.6->lion_pytorch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.6->lion_pytorch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.6->lion_pytorch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.6->lion_pytorch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.6->lion_pytorch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.6->l

In [6]:
import torch
from lion_pytorch import Lion

from helpers.load_data import load_data
from helpers.get_batch import get_batch
from helpers.estimate_loss import estimate_loss
from helpers.tokenizer import Tokenizer
from models.gpt import GPT

In [7]:
torch.manual_seed(13)

<torch._C.Generator at 0x7bda4efe68f0>

# Globals

In [18]:
BATCH_SIZE = 32
BLOCK_SIZE = 128
EVAL_INTERVAL = 100
LEARNING_RATE = 3e-4
EVAL_ITERS = 500
MAX_ITERS = 5000
N_EMBED = 64
N_HEADS = 4
N_LAYERS = 4
DROPOUT_RATE = 0.2
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Prepare Data

In [19]:
# Load data
text: str = load_data()

tokenizer = Tokenizer(text)
data = torch.tensor(tokenizer.encode(text))

# Create train and test sets
n = int(len(text) * 0.9)
train_data = data[:n]
test_data = data[n:]

# GPT

In [20]:
gpt = GPT(
    vocab_size=tokenizer.vocab_size,
    n_embed_size=N_EMBED,
    block_size=BLOCK_SIZE,
    device=DEVICE,
    n_heads=N_HEADS,
    n_layers=N_LAYERS,
    dropout_rate=DROPOUT_RATE
).to(DEVICE)

## Training loop

In [21]:
# optimizer = torch.optim.AdamW(gpt.parameters(), lr=LEARNING_RATE)
optimizer = Lion(gpt.parameters(), lr=LEARNING_RATE)

for iter in range(MAX_ITERS):

    if iter % EVAL_INTERVAL == 0:
        train_loss, val_loss = estimate_loss(
            model=gpt,
            train_data=train_data,
            valid_data=test_data,
            block_size=BLOCK_SIZE,
            batch_size=BATCH_SIZE,
            eval_iters=EVAL_ITERS,
            device=DEVICE
        )
        print(f"Step {iter}, Train loss: {train_loss:.4f}, Val loss: {val_loss:.4f}")

    xb, yb = get_batch(
        train_data,
        BLOCK_SIZE,
        BATCH_SIZE,
        device=DEVICE
    )

    logits, loss = gpt(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Step 0, Train loss: 4.3478, Val loss: 4.3510
Step 100, Train loss: 2.9397, Val loss: 2.9733
Step 200, Train loss: 2.6640, Val loss: 2.6738
Step 300, Train loss: 2.5418, Val loss: 2.5442
Step 400, Train loss: 2.4791, Val loss: 2.4823
Step 500, Train loss: 2.4309, Val loss: 2.4435
Step 600, Train loss: 2.3884, Val loss: 2.4036
Step 700, Train loss: 2.3458, Val loss: 2.3656
Step 800, Train loss: 2.2910, Val loss: 2.3174
Step 900, Train loss: 2.2429, Val loss: 2.2815
Step 1000, Train loss: 2.1979, Val loss: 2.2414
Step 1100, Train loss: 2.1585, Val loss: 2.2091
Step 1200, Train loss: 2.1202, Val loss: 2.1753
Step 1300, Train loss: 2.0739, Val loss: 2.1384
Step 1400, Train loss: 2.0297, Val loss: 2.0966
Step 1500, Train loss: 1.9927, Val loss: 2.0689
Step 1600, Train loss: 1.9575, Val loss: 2.0347
Step 1700, Train loss: 1.9176, Val loss: 2.0069
Step 1800, Train loss: 1.8895, Val loss: 1.9879
Step 1900, Train loss: 1.8600, Val loss: 1.9754
Step 2000, Train loss: 1.8327, Val loss: 1.9567
Step

In [23]:
start_token = torch.zeros((1, 1)).long().to(DEVICE)
sequence = gpt.generate(start_token, max_len=500, block_size=BLOCK_SIZE)[0].tolist()
print(tokenizer.decode(sequence))


EXETER:
You hunk good.
Your 'sount her by his duke
With feart, not homing rofe, make!

LEONTES:
And choose is est's make it that other. Slay's far? Thing.

THARDY I sreak not by: Pecarity her; in ruder
of madamianeton, vorery comesterly be eartain with
of his grands fair sold away forth.

LUCIO:
Bray, if I that sirment prince sented; I.
I would, his greath, an hopes toward,-sish,
Roim must a has had recure ye himself owerence
Whert that oun such revard more than which forten,-
Thy works; all not
