# All about LLMs

This notebook walks you through creating your own Generative Pretrained Transformer (aka GPT). First we explore tokens, the atomic units of these large language models.  We will then move on to data preparation, training a model and finally, generating text with the model.

# Tokens

In [3]:
!pip install tiktoken
!pip install transformers
!pip install datasets
!pip install tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

In [5]:
import tiktoken

enc = tiktoken.get_encoding("gpt2")
phrase = "The quick brown fox jumped over the lazy dog."
tokens = enc.encode_ordinary(phrase)
print(tokens)

ModuleNotFoundError: No module named 'tiktoken'

In [4]:
for token in tokens:
    print(f'"{enc.decode([token])}", ',end='')

"The", " quick", " brown", " fox", " jumped", " over", " the", " lazy", " dog", ".", 

# Tokenize Training Data

We will use the Tiny Shakespeare dataset to train a model. See the [shakespeare.txt](data/shakespeare.txt) file.

In [37]:
import os
import requests
import tiktoken
import numpy as np

# tiny shakespeare dataset
input_file_path = 'data/shakespeare.txt'
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r') as f:
    data = f.read()
n = len(data)

# Use 90% for training and 10% for validation
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile('data/train.bin')
val_ids.tofile('data/val.bin')

train has 301,966 tokens
val has 36,059 tokens


In [27]:
print(len(train_ids))
train_ids

301966


array([ 5962, 22307,    25, ...,   508,  2058,   994], dtype=uint16)

In [19]:
enc.decode(train_ids[:5])

'First Citizen:\nBefore'

# GPT Model Class

We will use the `model.py` file contains a GPT model class that was created by Andrej Karpathy. It was modeled after the one created by OpenAI.

GPT, as the name implies, is based on the Transformer technology presented by Google in the paper, "Attention Is All You Need." https://arxiv.org/pdf/1706.03762

References:
* Great Video by Andrej Karpathy - _Let's build GPT: from scratch, in code, spelled out._ - https://www.youtube.com/watch?v=kCc8FmEb1nY
* nanoGPT - https://github.com/karpathy/nanoGPT
* OpenAI GPT-2 - https://github.com/openai/gpt-2/blob/master/src/model.py

# Train a Model

In [42]:
import os
import time
import math
from contextlib import nullcontext

import numpy as np
import torch
from model import GPTConfig, GPT
# pylint: disable=invalid-name

# parameters
out_dir = 'out'
eval_interval = 100
log_interval = 1
eval_iters = 200
always_save_checkpoint = True # save a checkpoint after each eval_interval

# data
dataset = 'data' # directory where the data is stored
gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 32 # content window size

# model
n_layer = 4
n_head = 4
n_embd = 64
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?

# adamw optimizer
learning_rate = 6e-4 # max learning rate
max_iters = 2000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0

# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 2000 # how many steps to warm up for
lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla

# system
device = 'cuda' # examples: 'cpu', 'cuda', or try 'mps' on macbooks
dtype = 'bfloat16' # 'float32', 'bfloat16', or 'float16'

# capture settings / parameters to save in model checkpoint
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
config = {k: globals()[k] for k in config_keys} 

# various inits, derived attributes, I/O setup
seed = 1337
tokens_per_iter = gradient_accumulation_steps * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

# place to store the model we create (checkpoint)
os.makedirs(out_dir, exist_ok=True)

# 
torch.manual_seed(1337)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# Load the data
data_dir = 'data/'
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')

# get a batch from the data
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

# init
iter_num = 0
best_val_loss = 1e9

model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line

# check to make sure checkpoint does not exist
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
if os.path.exists(ckpt_path):
    ans = input("WARNING: Existing checkpoint found, overwrite? (y/N)")
    if ans.lower()[:1] != "y":
        exit()

# init a new model from scratch
print("Initializing a new model from scratch")
print("Using vocab_size of GPT-2 of 50304 (50257 rounded up for efficiency)")
model_args['vocab_size'] = 50304
gptconf = GPTConfig(**model_args)
model = GPT(gptconf)

# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size # so that the checkpoint will have the right value
model.to(device)

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
checkpoint = None # free up memory

# compile the model
print("compiling the model... (takes a ~minute)")
unoptimized_model = model
model = torch.compile(model) # requires PyTorch 2.0

# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# training loop
X, Y = get_batch('train') # fetch the very first batch
t0 = time.time()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = model
running_mfu = -1.0
while True:

    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 or iter_num >= max_iters:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        if iter_num % 100 == 0:
            print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break


tokens per iteration will be: 15,360




Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 3.42M
num decayed parameter tensors: 18, with 3,418,112 parameters
num non-decayed parameter tensors: 9, with 576 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 10.8197, val loss 10.8165
iter 0: loss 10.8140, time 10577.74ms, mfu -100.00%
step 100: train loss 10.5333, val loss 10.5307
saving checkpoint to out
iter 100: loss 10.5426, time 1459.30ms, mfu 0.40%
step 200: train loss 10.0636, val loss 10.0735
saving checkpoint to out
iter 200: loss 10.0690, time 1778.58ms, mfu 0.40%
step 300: train loss 9.3313, val loss 9.3641
saving checkpoint to out
iter 300: loss 9.3369, time 1377.69ms, mfu 0.41%
step 400: train loss 8.4176, val loss 8.4931
saving checkpoint to out
iter 400: loss 8.4212, time 1397.11ms, mfu 0.40%
step 500: train loss 7.5251, val loss 7.6176
saving checkpoint to out
iter 500: loss 7.6322, t

# Inference

In [2]:
import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT

# inference parameters
max_tokens = 500 # number of tokens generated in each sample
temperature = 1.0 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1337
device = 'cuda'
dtype = 'bfloat16'

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# load model saved in a specific directory
ckpt_path = 'out/ckpt.pt'
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
model.eval()
model.to(device)
enc = tiktoken.get_encoding("gpt2")

# set prompt
prompt = "First Citizen\n"
start_ids = enc.encode(prompt, allowed_special={"<|endoftext|>"})
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

# run generation
with torch.no_grad():
    with ctx:
        y = model.generate(x, max_tokens, temperature=temperature, top_k=top_k)
        output = y[0].tolist()
        for w in output:
            if w > 50257: # max token value
                continue
            else:
                # print(f"[{w}]", end='')
                text = enc.decode([w])
                if text == '\n':
                    print()
                else:
                    print(text, end='')
                
print("")


number of parameters: 3.42M
First Citizen
DUKE VINCENTIO:
I think so, I'll bear no heart.

GRUMIO:
I'll have you an more news.

CLARENCE:
My woman's friend, and to the life-merTHUMBERLAND:
By they would not take it by my husband,
My death to give thee as every hour or
Which from the Duke of Lancaster's comfort,
And thou thou call me in his master than unshock
We are none of't. To me's hence!

LEONTES:
My mother'st, this rest, to see
Where him: but once more would you know the sin
The time now you have done as ever show.

KING EDWARD IV:
Here say our course of me? O God, I warrant,
That thou do, I should be thy eye, nor the hour
To command me much of the body.

KING HENRY's one, I will not mine pardon him.

SICINIUS:
It is't speak.

VOLIO:
For, the world be far in joy!

ROMEO:
What time did it be the boy duke have, here upon thee:
My wife's proud times is, to hear which such king
But that be the king, but they tell this night
On VerLY: my sovereign, by the people or end
The word with al