This notebook contains solutions to the exercises from Chapter 5.

## Exercise 5.1: Pizza sampling :)

Use the print_sampled_tokens function to print the sampling frequencies of the
softmax probabilities scaled with the temperatures shown in figure 5.14. How often
is the word pizza sampled in each case? Can you think of a faster and more accurate
way to determine how often the word pizza is sampled?

In [5]:
import torch

vocab = {
    "closer": 0, "every": 1, "effort": 2, "forward": 3,
    "inches": 4, "moves": 5, "pizza": 6, "toward": 7, "you": 8
}
inverse_vocab = {v: k for k, v in vocab.items()}
next_token_logits = torch.tensor([4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79])

def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

def print_sampled_tokens(probas):
    torch.manual_seed(123)
    sample = [torch.multinomial(probas, num_samples=1).item() for _ in range(1_000)]
    sampled_ids = torch.bincount(torch.tensor(sample), minlength=len(vocab))
    for i, freq in enumerate(sampled_ids):
        print(f"{freq:3d} x {inverse_vocab[i]}")

for T in [0.1, 1.0, 5.0]:
    print(f"\nSampling at temperature {T}:")
    probas = softmax_with_temperature(next_token_logits, T)
    print_sampled_tokens(probas)


Sampling at temperature 0.1:
  0 x closer
  0 x every
  0 x effort
985 x forward
  0 x inches
  0 x moves
  0 x pizza
 15 x toward
  0 x you

Sampling at temperature 1.0:
 73 x closer
  0 x every
  0 x effort
582 x forward
  2 x inches
  0 x moves
  0 x pizza
343 x toward
  0 x you

Sampling at temperature 5.0:
165 x closer
 75 x every
 42 x effort
239 x forward
 71 x inches
 46 x moves
 32 x pizza
227 x toward
103 x you


## Exercise 5.2: Adjusting temperature and top-k settings

Play around with different temperatures and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are
desired? Likewise, can you think of applications where higher temperature and top-k
settings are preferred? (It’s recommended to also revisit this exercise at the end of
the chapter after loading the pretrained weights from OpenAI.)

In [6]:
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
masked_logits = torch.where(
    next_token_logits < top_logits[-1],
    torch.tensor(float('-inf')),
    next_token_logits
)
top_k_probs = torch.softmax(masked_logits, dim=0)
print_sampled_tokens(top_k_probs)

 73 x closer
  0 x every
  0 x effort
583 x forward
  0 x inches
  0 x moves
  0 x pizza
344 x toward
  0 x you


## Exercise 5.3: Finding combinations of parameters

What are the different combinations of settings for the generate function to force
deterministic behavior, that is, disabling the random sampling such that it always produces the same outputs similar to the generate_simple function?

In [7]:
import torch
import tiktoken
from previous_chapters import GPTModel
from gpt_generate import generate, text_to_token_ids, token_ids_to_text
from gpt_generate import generate, text_to_token_ids, token_ids_to_text
from previous_chapters import generate_text_simple

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth", weights_only=True))
model.eval()

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print(token_ids_to_text(token_ids, tokenizer))

torch.save(model.state_dict(), "model.pth")

Every effort moves you rentingetic wasnم refres RexMeCHicular stren Mortgage TT remember gard ACTIONSussedOND Land Engeleddedemate breaths proxies GalaxyForm


## Exercise 5.4: Pretraining the model for one more epoch

After saving the weights, load the model and optimizer in a new Python session or
Jupyter notebook file and continue pretraining it for one more epoch using the train_model_simple function.

In [8]:
import torch
import tiktoken
from previous_chapters import GPTModel, create_dataloader_v1
from gpt_train import train_model_simple

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = tiktoken.get_encoding("gpt2")

model = GPTModel(GPT_CONFIG_124M).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

split = int(0.9 * len(text))
train_loader = create_dataloader_v1(text[:split], batch_size=2, max_length=256, stride=256, drop_last=True, shuffle=True, num_workers=0)
val_loader = create_dataloader_v1(text[split:], batch_size=2, max_length=256, stride=256, drop_last=False, shuffle=False, num_workers=0)

train_model_simple(model, train_loader, val_loader, optimizer, device,
                   num_epochs=1, eval_freq=5, eval_iter=5,
                   start_context="Every effort moves you", tokenizer=tokenizer)

# Save model and optimizer state
torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict()
}, "model_and_optimizer.pth")

Ep 1 (Step 000000): Train loss 9.960, Val loss 10.150
Ep 1 (Step 000005): Train loss 8.139, Val loss 8.340
Every effort moves you.                                                 


In [9]:
checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

## Exercise 5.5: Calculating losses on Verdict

Calculate the training and validation set losses of the GPTModel with the pretrained
weights from OpenAI on the “The Verdict” dataset.

In [16]:
import torch
import tiktoken
from previous_chapters import GPTModel, create_dataloader_v1
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt
from gpt_train import calc_loss_loader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")
settings, params = download_and_load_gpt2("124M", models_dir="gpt2")

GPT_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": True
}

gpt = GPTModel(GPT_CONFIG)
gpt.eval()
load_weights_into_gpt(gpt, params)
gpt.to(device)

# Load and tokenize the dataset
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Tokenize text (instead of slicing raw characters)
text_ids = tokenizer.encode(text)
split_idx = int(0.9 * len(text_ids))

train_ids = text_ids[:split_idx]
val_ids = text_ids[split_idx:]

# Decode token ids back to text
train_text = tokenizer.decode(train_ids)
val_text = tokenizer.decode(val_ids)

# Create DataLoaders
train_loader = create_dataloader_v1(train_text, batch_size=2, max_length=1024, stride=1024, drop_last=True, shuffle=True)
val_loader = create_dataloader_v1(val_text, batch_size=2, max_length=1024, stride=1024, drop_last=True, shuffle=False)

# Calculate losses
train_loss = calc_loss_loader(train_loader, gpt, device)
val_loss = calc_loss_loader(val_loader, gpt, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe
Training loss: 3.5319101810455322
Validation loss: nan


In [15]:
print(val_text)

 Stroud--he just lay there quietly watching, and on his lips, through the gray beard, I seemed to hear the question: 'Are you sure you know where you're coming out?'

"If I could have painted that face, with that question on it, I should have done a great thing. The next greatest thing was to see that I couldn't--and that grace was given me. But, oh, at that minute, Rickham, was there anything on earth I wouldn't have given to have Stroud alive before me, and to hear him say: 'It's not too late--I'll show you how'?

"It _was_ too late--it would have been, even if he'd been alive. I packed up my traps, and went down and told Mrs. Stroud. Of course I didn't tell her _that_--it would have been Greek to her. I simply said I couldn't paint him, that I was too moved. She rather liked the idea--she's so romantic! It was that that made her give me the donkey. But she was terribly upset at not getting the portrait--she did so want him 'done' by some one showy! At first I was afraid she wouldn't

## Exercise 5.6: Experimenting on different GPT-2 models

Experiment with GPT-2 models of different sizes—for example, the largest 1,558 mil-
lion parameter model—and compare the generated text to the 124 million model.

In [18]:
import torch
import tiktoken
from previous_chapters import GPTModel
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt, generate, text_to_token_ids, token_ids_to_text

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)

tokenizer = tiktoken.get_encoding("gpt2")

GPT_BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.1,
    "qkv_bias": True
}

model_configs = {
    "124M": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "355M": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "774M": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "1558M": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

for size in ["124M", "1558M"]:   # You can add "355M", "774M" if you want
    print(f"\n=== Generating with GPT-2 {size} ===\n")

    settings, params = download_and_load_gpt2(model_size=size, models_dir="gpt2")

    # Build config
    config = GPT_BASE_CONFIG.copy()
    config.update(model_configs[size])

    # Initialize model
    gpt = GPTModel(config)
    gpt.eval()
    load_weights_into_gpt(gpt, params)
    gpt.to(device)

    # Prepare input
    start_text = "Every effort moves you"
    idx = text_to_token_ids(start_text, tokenizer)

    # Generate output
    output_ids = generate(
        model=gpt,
        idx=idx,
        max_new_tokens=50,
        context_size=config["context_length"],
        temperature=1.0,
        top_k=50
    )

    output_text = token_ids_to_text(output_ids, tokenizer)
    print(output_text)


=== Generating with GPT-2 124M ===

File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe
Every effort moves you along the way, but in the end it's all about how you grow up" — John Steinbeck, "The Prisoner's Dilemma"

"It's never too late to learn not to be a liar" is what we

=== Generating with GPT-2 1558M ===

File already exists and is up-to-date: gpt2/1558M/checkpoint
File already exists and is up-to-date: gpt2/1558M/encoder.json
File already exists and is up-to-date: gpt2/1558M/hparams.json


model.ckpt.data-00000-of-00001: 100%|██████████| 6.23G/6.23G [09:32<00:00, 10.9MiB/s] 
model.ckpt.index: 100%|██████████| 20.7k/20.7k [00:00<00:00, 543kiB/s]
model.ckpt.meta: 100%|██████████| 1.84M/1.84M [00:00<00:00, 5.11MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 2.08MiB/s]


Every effort moves you towards the unknown...I'm always looking for an edge that I can exploit." - John

I believe the key to success is a strong belief in yourself. You can come across as arrogant if you have the desire to be one, but most
