# Unsupervised Pre-Training of GPT-Style Model

In today's notebook, we'll be working through an example of how to do unsupervised pre-training of a GPT-style model.

The base model we'll use is Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT).

All of the model code can be found in the [`model.py`](https://github.com/karpathy/nanoGPT/blob/master/model.py) file!

> NOTE: We will not be leveraging the parallized training strategy in this notebook - you can find all the required code in the provided repository.

## Data Selection

For the notebook today, we'll be using a toy dataset called `tinyshakespeare`. Feel free to use your own corpus here, just make sure it's contained within a single `.txt` file.

You could extend this example to use the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset, which was used to pre-train GPT-2.

> NOTE: Training LLMs can take a very long time - in order to get results similar to the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) you will need 8xA100s and train for ~4-5 days using a pararellized strategy (DDP) on the OpenWebText Corpus.

Let's start by grabbing our source repository for the day!

In [9]:
%pwd

'/home/paperspace/llm-engineering-course/hw1'

In [10]:
# !git clone https://github.com/karpathy/nanoGPT.git

Next, we'll need to grab some dependencies.

`cohere` and `openai` are recent dependencies of `tiktoken`, but we will not be leveraging them today.

In [11]:
# %pip install tiktoken requests cohere openai -q

First things first - let's download our dataset!

We'll leverage the `requests` library to do this - and then we will split our resultant data into a `train` and `val` set. We want ~90% of our data to be training, and ~10% to be validation.

In [12]:
import os
from pathlib import Path
import requests
import tiktoken
import numpy as np

# base_path = Path("/Users/neil/Projects/llm-engineering-course/hw1") 
base_path = Path("/home/paperspace/llm-engineering-course/hw1") 
data_path = base_path / "data/shakespeare"
input_file_path = data_path / "input.txt"

In [13]:
current_path = base_path / "data/shakespeare"
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

if not os.path.exists(current_path):
    os.makedirs(current_path)

# download the tiny shakespeare dataset
# input_file_path = os.path.join(os.path.dirname(current_path), 'input.txt')
input_file_path = current_path / "input.txt"
if not os.path.exists(input_file_path):

    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r') as f:
    data = f.read()

n = len(data)
print(n)

1115394


In [14]:
i = int(np.ceil(0.9*n))
train_data = data[:i]
val_data = data[i:]
n == len(train_data) + len(val_data)

True

Now let's get our `tokenizers` dependency so we can train a tokenizer on our data.

In [15]:
#%pip install tokenizers -qU

We will be training a "byte-pair-encoding" or "BPE" tokenizer. If you'd like to read more, you can find it [here](https://en.wikipedia.org/wiki/Byte_pair_encoding).

Let's work through an example of what Byte-Pair Encoding (BPE) is doing, exactly, from this wonderful example provided by [Hugging Face](https://huggingface.co/docs/transformers/main/tokenizer_summary#byte-pair-encoding-bpe).





### What is BPE?

First, we need to do a step called "pre-tokenization", which is - as it sounds - a tokenization step that occurs before we tokenize.

The essential idea of BPE is that we need to understand common words and "byte-pairs" in them. So, in order to find "common words" we first need to find...words!

Let's take the following text and break it apart into its word components.


```
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.
```

A naive way to do this would just be by splitting on spaces...and that is indeed what technique was used in GPT-2.

In [16]:
input_text = """
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.
"""

naive_word_list = input_text.split()

Now we can count our words and get their frequency.

In [17]:
from collections import defaultdict

vocab_and_frequencies = defaultdict(int)

for word in naive_word_list:
  vocab_and_frequencies[" ".join(list(word))] += 1

sorted(vocab_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[('t h e', 8), ('a', 4), ('o f', 4), ('v o c a b u l a r y', 4), ('h a s', 3)]

Let's find our "base vocabulary", which is going to be each symbol present in our original dataset.

In [18]:
from typing import Dict, Tuple, List, Set

def find_vocabulary_size(current_vocab: Dict[str, int]) -> int:
  vocab = set()

  for word in current_vocab.keys():
    for subword in word.split():
      vocab.add(subword)

  return len(vocab)

In [19]:
find_vocabulary_size(vocab_and_frequencies)

34

As we can see, there are ~34 symbols in our base vocabulary. Let's convert our data into a form where we can capture each symbol separately.

Now we can start constructing our pairs. We will look at all the pairs of symbols as they appear and take into consideration their frequency in our corpus.

In [13]:
def find_pairs_and_frequencies(current_vocab: Dict[str, int]) -> Dict[str, int]:
  pairs = {}

  for word, frequency in current_vocab.items():
    symbols = word.split()

    for i in range(len(symbols) - 1):
      pair = (symbols[i], symbols[i + 1])
      current_frequency = pairs.get(pair, 0)
      pairs[pair] = current_frequency + frequency

  return pairs

In [14]:
pairs_and_frequencies = find_pairs_and_frequencies(vocab_and_frequencies)

In [15]:
sorted(pairs_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[(('t', 'h'), 11),
 (('i', 'n'), 10),
 (('r', 'e'), 8),
 (('h', 'e'), 8),
 (('a', 't'), 7)]

Now that we have the frequent pairs - we can merge those pairs into a single token.

Let's see how this process looks in code.

In [16]:
import re

def merge_vocab(most_common_pair: Tuple[str], current_vocab: Dict[str, int]) -> Dict[str, int]:
  vocab_out = {}

  pattern = re.escape(' '.join(most_common_pair))
  replacement = ''.join(most_common_pair)

  for word_in in current_vocab:
      word_out = re.sub(pattern, replacement, word_in)
      vocab_out[word_out] = current_vocab[word_in]

  return vocab_out

In [17]:
new_vocab_and_frequencies = merge_vocab(
    sorted(pairs_and_frequencies.items(), key = lambda x: x[1], reverse=True)[0][0],
    vocab_and_frequencies
)

In [18]:
sorted(new_vocab_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[('th e', 8), ('a', 4), ('o f', 4), ('v o c a b u l a r y', 4), ('h a s', 3)]

After one merge, we can see that `t h` has been converted to `th`!

Let's see how that impacted our vocabulary.

In [19]:
find_vocabulary_size(new_vocab_and_frequencies)

35

We can see that our vocabulary has increased by 1 as we've added the `th` symbol to it!

In essence, BPE will continue to do this process until your desired vocabulary size (a hyper-parameter) is met!

## Training Our Tokenizer

Now that we have some background on how BBPE works, lets move on to training our tokenizer for our model!

Let's walk through the steps we'll take:

1. Initialize our `Tokenizer` with a `BPE` model. Be sure to include the `unk_token`.

  - [`Tokenizer`](https://huggingface.co/docs/tokenizers/api/tokenizer#tokenizer)
  - [`Models`](https://huggingface.co/docs/tokenizers/api/models#models)

2. We'll include a normalizer, applied at the sequence level, and we'll use `NFD()` to do so. More reading on Unicode Normalization Forms [here](https://unicode.org/reports/tr15/#Normalization_Forms_Table).

  - [`NFD()`](https://huggingface.co/docs/tokenizers/api/normalizers#tokenizers.normalizers.NFD)

3. We'll also add our `ByteLevel()` pre-tokenizer, and our `ByteLevelDecoder()` decoder.

  - [`ByteLevel()`](https://huggingface.co/docs/tokenizers/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel)
  - [`ByteLevelDecoder()`](https://huggingface.co/docs/tokenizers/api/decoders#tokenizers.decoders.ByteLevel)

In [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFD, Sequence
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<unk>"))
tokenizer.normalizer = Sequence([
    NFD()
])
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()

We'll want to add some special tokens to our tokenizer to ensure it has access to common token patterns.

Let's use the following:

- `"<s>"`    : bos_token - beginning of sequence token
- `"</s>"`   : eos_token - end of sequence token
- `"<pad>"`  : padding_token - token used to pad sequences
- `"<unk>"`  : unk_token - token used to represent unknown tokens.
- `"<mask>"` : mask_token - token used to mask parts of our sequence

We're also going to set a target vocabulary of 50,000 tokens.

In [21]:
trainer = BpeTrainer(
    vocab_size=50000,
    show_progress=True,
    special_tokens=[
      "<unk>", "<s>", "<pad>", "<mask>", "</s>"
    ]
)

Nothing left to do but point it at our data-source and let it train!

We'll use the `.train()` method to accomplish this task.

> NOTE: Pay attention to the desired inputs of the `.train()` method.

- [`Tokenizer.train()`](https://huggingface.co/docs/tokenizers/api/tokenizer#tokenizers.Tokenizer.train)

In [22]:
tokenizer.train(
    files=[str(input_file_path)],
)






Now we can save our tokenizer - and then load it as a `GPT2Tokenizer` through the Hugging Face Library!

In [23]:
# save_path = '/content/tokenizer'
save_path = base_path / 'tokenizer'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.model.save(str(save_path))

['/home/paperspace/llm-engineering-course/hw1/tokenizer/vocab.json',
 '/home/paperspace/llm-engineering-course/hw1/tokenizer/merges.txt']

In [24]:
#!pip install transformers -qU

In [25]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(save_path, unk_token="[UNK]")

Let's see how it tokenizes our inputs!

In [26]:
input_sentence = "Hark, my name be Romeo! I am but a beautiful summer's day!"

In [27]:
tokenized_sentence = ### ??
tokenized_sentence

SyntaxError: invalid syntax (1588958192.py, line 1)

In [28]:
encoded_tokens = tokenizer.encode(input_sentence)
encoded_tokens

[12072, 4, 119, 632, 116, 821, 0, 82, 290, 214, 67, 9108, 2994, 136, 506, 0]

In [29]:
decoded_tokens = tokenizer.decode(encoded_tokens)
decoded_tokens

"Hark, my name be Romeo! I am but a beautiful summer's day!"

## Tokenizing Dataset

Now that we have trained our tokenizer - let's create a dataset we can leverage with the `nanoGPT` library.

We'll simply encode our training and validation data - and then save them in binary files for later!

> NOTE: Pay attention to the format you want your dataset in. We want ids, which means we want to use the [`.encode()`](https://huggingface.co/docs/tokenizers/api/tokenizer#tokenizers.Tokenizer.encode) method of our tokenizer.

In [30]:
train_ids = tokenizer.encode(train_data)
val_ids = tokenizer.encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

train has 291,285 tokens
val has 34,222 tokens


In [31]:
# export to bin files
# data_path = "/data/shakespeare/"
data_path = base_path / "data/shakespeare/"

train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(data_path / 'train.bin')
val_ids.tofile(data_path / 'val.bin')
# train_ids.tofile(os.path.join(os.path.dirname(data_path), 'train.bin'))
# val_ids.tofile(os.path.join(os.path.dirname(data_path), 'val.bin'))

## Training The Model

Now that we have our tokenized dataset, let's get to training our model!

We have a lot of set-up to do before we click "`.train()`", so let's jump right into it!

First, let's literally jump into the `nanoGPT` repository we cloned earlier.

In [20]:
%cd nanoGPT

/home/paperspace/llm-engineering-course/hw1/nanoGPT


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


We'll do some critical imports.

In [22]:
import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch

# from the local repo
from model import GPTConfig, GPT

### Hyper-Parameters

We have a laundry list of hyper-parameters to set up - let's walk through them and what they mean.

#### I/O

- `out_dir` - simple enough, this is the output directory where our checkpoints are saved

In [23]:
out_dir = base_path / 'out'

#### Initialization

Since we're training from scratch, we'll use `init_from = 'scratch'`.

In [26]:
init_from = 'scratch'

#### Eval and Logging

- `eval_interval` - this is the number of steps between evaluation stages, we'll want to see this ~`250`. Our model will be incredibly prone to over-fitting, and this will let us monitor with relative frequency.
- `log_interval` - this is how often our training progress will log. You can set this ~`10`. It's dealer's choice, really.
- `eval_iters` - this is how *many* iterations we want to evaluate for.
- `eval_only` - this would evaluate our model - but not train it. We'll leave this as `False` for now.
- `always_save_checkpoint` - this will always save our most recent checkpoint, regardless of metrics. For this example, we'll set this to `True`.

In [27]:
eval_interval = 250
eval_iters = 10 
log_interval = 10 
eval_only = False
always_save_checkpoint = True

#### Dataset

We can set our dataset here - we'll use the one we created earlier!

In [28]:
dataset = 'shakespeare'

#### Typical Hyper-Parameters

- `gradient_accumulation_steps` - we can use gradient accumulation to "simulate" larger batch sizes by combining multiple different optimization steps together, without needing the additional memory for large batch sizes. We don't need to worry so much about this for the toy problem - but this hyper-parameter can be configured for larger training runs. [Here](https://lightning.ai/blog/gradient-accumulation/) is some great reading on the topic.
- `batch_size` - Typical batch_size - the larger the merrier (up to a point) we'll be using `16` to ensure we do not exceed the memory quota of our GPU.
- `block_size` - this can be thought of as another term for the `context window` of our model. Since our model cannot take variable length inputs - we use this to set all inputs to our desired size. We'll use a value of `512` to ensure speedy training.

In [29]:
gradient_accumulation_steps = 1
batch_size = 128 
block_size = 512 

#### Model Architecture

- `n_layer` - this is the number of decoder layers we will use in our model. More would be considered better (up to a point) and the original GPT-2 paper uses `12`, but we will be using a truncated `6` for ease and speed of training.
- `n_head` - this is the number of attention heads in each decoder layer!
- `n_embd` - this is the embedding dimension of our model, this is analagous to our `model_d` from the previous notebook. A default value of ~`500` should do the trick!
- `dropout` - this sets our dropout value, since our model is small and going to be extremely prone to overfitting, consider setting this at a fairly aggresive level (`0.2` was used in the example training found in the notebook`).
- `bias` - wether or not to use bias inside the LayerNorm/Linear layers.

In [30]:
n_layer = 12
n_head =  12 
n_embd = n_head * 64
dropout = 0.4
bias = False

#### ❓ QUESTION:

What condition must be true as it relates to the `n_embd` and `n_head`?

`n_embd % n_head == 0`; not exactly sure why.

#### Optimizer Hyper-Parameters

Basic Optimizer Hyper-Parameters:

- `learning_rate` - it's our learning rate! We'll want to set this fairly high ~`1e-3` since we're training on such a small dataset.
- `max_iters` - how many iterations do we train for. More iters means longer training times. Feel free to tinker with this value! `5000` is a great place to start.

Learning Rate Decay Settings:

- `decay_lr` - set decay flag
- `weight_Decay` - how much to decay lr by
- `lr_decay_iters` - should be set to ~max_iters.
- `min_lr` - the minimum lr, should be ~ lr / 10

Clipping and Warmup:

- `grad_clip` - value to clip gradients to. useful for preventing vanishing gradients.
- `warmup_iters` - how many iterations to warmup for. Warmup is useful to allow your training to slowly warmup. It will use a low lr for a number of steps to avoid any massive initial spikes. Since we're training a very small model - we can avoid using many wamrup steps.

> NOTE: Many learnings taken from the [Chincilla paper](https://arxiv.org/pdf/2203.15556.pdf) for selecting default or appropriate values.

In [31]:
# adamw optimizer
learning_rate = 1e-3
max_iters = 5000
beta1 = 0.9
beta2 = 0.99

# lr decay settings
decay_lr = True
weight_decay = 1.0
lr_decay_iters = max_iters
min_lr = learning_rate/10

# clipping and warmup
grad_clip = 1.0
warmup_iters = 100

#### ❓ QUESTION:

Given a Learning Rate of `1e-4` and a maximum iteration cap of `10,000`: What should `lr_decay_iters` be, and what should `min_lr` be?

`min_lr = 1e-4/10`

`lr_decay_iters = 10000`

These hyper-parameters are necessary to set given the task we're training and given the environment we're training in.

In [43]:
backend = 'nccl'
device = 'cuda'
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
compile = True
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
config = {k: globals()[k] for k in config_keys}
# -----------------------------------------------------------------------------
master_process = True
seed_offset = 0
ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")
os.makedirs(out_dir, exist_ok=True)

tokens per iteration will be: 65,536


### Torch Settings

We need to set a few `torch` settings, including the seed, to allow us to train correctly on our GPU.

Not much is required for us to understand here - these are just necessary lines of code. Boilerplate.

In [44]:
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu'
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

### Dataloader

This block will:

1. Set the data path
2. Load the dataset we tokenized earlier from the `.bin` we saved
3. Define a `get_batch` function that will return us a random section of our data as well as a the corresponding "label" for that data and move it to the GPU for easy use inside our training loop.

In [45]:
data_dir = os.path.join('/data', dataset)
data_dir = base_path / "data" / dataset
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

#### ❓ Question:

What can you tell us about the way the labels are generated?
1.  First a random set of size `batch_size` of indices from the interval `[a, b)` is chosen
where `a = 1` and `b = len(data) - block_size`.  We subtract `block_size` so that we
do not "fall off the end" of the data.
2.  To get a data point, `x`, choose an index, `i`, from the set above; then get a `block_size` of
tokens from data starting at position `i` (`data[i:i+block_size]`).
3.  To get a label correspond to the data point from 2., shift right one token (`data[i+1:i+1+block_size]`).

I think this is done so that the model is pre-trained by "trying" to predict the next token given
previous token.

Please produce an example of a single x and y pair.

In [46]:
i = 10
x = train_data[i:i+block_size].astype(np.int64)
y = train_data[i+1:i+1+block_size].astype(np.int64)
print(f"x: {x[:5]}")
print(f"y: {y[:5]}")

x: [2087    4  491  131  428]
y: [  4 491 131 428   6]


### Simple Initialization of Model

Here we init our number of iterations as 0, and our best val loss as a very high number.

In [47]:
iter_num = 0
best_val_loss = 1e9

Obtain our vocab size from our trained tokenizer.

In [48]:
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = tokenizer.vocab_size
meta_vocab_size

20094

Create our model args dict.

Use the following as a guide: [Here](https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L109)

In [49]:
model_args = dict( 
    block_size = block_size,
    vocab_size = meta_vocab_size,
    n_layer = n_layer,
    n_head = n_head,
    n_embd = n_embd,
    dropout = dropout,
    bias = True 
)

Instantiate our model with the provided `model_args`.

These are derived from the hyper-parameters we set above.

In [50]:
if init_from == 'scratch':
    print("Initializing a new model from scratch")
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)

Initializing a new model from scratch
number of parameters: 100.49M


There we go! If you used the default values - you should have a model with 29.55M parameters!

Let's set our block_size to the correct size as determined in our configuration steps.

In [51]:
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size

Now we can look at our model in all its glory!

In [52]:
model.to(device)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(20094, 768)
    (wpe): Embedding(512, 768)
    (drop): Dropout(p=0.4, inplace=False)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.4, inplace=False)
          (resid_dropout): Dropout(p=0.4, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.4, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=768, out_features=20094, bias=False)
)

We'll set up our GradScaler - more information on this process [here](https://pytorch.org/docs/stable/amp.html#gradient-scaling).

In [53]:
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

Let's set up our optimizer below. Be sure to include the correct values. You can check the `model.py` file for more information on what is expected in the `configure_optimizers` method [here](https://github.com/karpathy/nanoGPT/blob/eba36e84649f3c6d840a93092cb779a260544d08/model.py#L263C85-L263C85).

In [54]:
optimizer = model.configure_optimizers(
    weight_decay,
    learning_rate,
    (beta1, beta2),
    device_type
)

checkpoint = None

num decayed parameter tensors: 50, with 100,760,064 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True


Now we can compile our model!

If you're using the T4 or V100 instance of Colab - this will not provide a signficant speed-up, but if you're using Ampere architecture (A100) you should notice a significant difference between the compiled and uncompiled model.

Read more about `torch.compile()` [here](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).

In [55]:
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

compiling the model... (takes a ~minute)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


We'll set up our loss estimation function here, which will help us estimate an arbitrarily accurate loss over either training or validation data by using many batches.

You'll notice that we quickly convert the model into `.eval()` model and then back to `.train()` mode.

In [56]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

### Creating our LR Scheduler

Beyond just slowly reducing our learning rate over time - we can use an LR Scheduler to allow us to move our learning according to a desired pattern.

We will use a "cosine with warmup" schedule and our learning rate, thusly, will follow this pattern:

![img](https://i.imgur.com/KoFEl0b.png)

There are many different schedulers, and many different ways to handle learning rate, and you can read about just a few of them [here](https://d2l.ai/chapter_optimization/lr-scheduler.html)!

In [57]:
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

We need to set some specific values in our env to allow training in Colab.

In [58]:
# !export LC_ALL="en_US.UTF-8"
# !export LD_LIBRARY_PATH="/usr/lib64-nvidia"
# !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
# !ldconfig /usr/lib64-nvidia

## The Training Loop

Now we can finally grab our first batch and set our initial time to calculate how long our iterations are taking!

In [59]:
X, Y = get_batch('train')
t0 = time.time()
local_iter_num = 0
raw_model = model
running_mfu = -1.0 # model flops utilization

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

step 0: train loss 10.0417, val loss 10.0452
iter 0: loss 10.0393, time 37416.80ms, mfu -100.00%
iter 10: loss 7.9366, time 321.43ms, mfu 43.10%
iter 20: loss 6.6990, time 321.98ms, mfu 43.09%
iter 30: loss 5.9097, time 321.80ms, mfu 43.09%
iter 40: loss 5.6520, time 322.53ms, mfu 43.08%
iter 50: loss 5.4340, time 322.93ms, mfu 43.06%
iter 60: loss 5.2163, time 323.04ms, mfu 43.04%
iter 70: loss 5.0008, time 323.61ms, mfu 43.02%
iter 80: loss 4.8323, time 323.53ms, mfu 43.00%
iter 90: loss 4.6810, time 323.91ms, mfu 42.98%
iter 100: loss 4.5772, time 325.00ms, mfu 42.94%
iter 110: loss 4.4567, time 324.04ms, mfu 42.92%
iter 120: loss 4.3930, time 324.32ms, mfu 42.90%
iter 130: loss 4.1994, time 325.17ms, mfu 42.87%
iter 140: loss 4.1690, time 325.06ms, mfu 42.85%
iter 150: loss 4.0846, time 324.30ms, mfu 42.83%
iter 160: loss 3.9569, time 323.92ms, mfu 42.83%
iter 170: loss 3.9564, time 324.75ms, mfu 42.81%
iter 180: loss 3.8598, time 324.10ms, mfu 42.80%
iter 190: loss 3.7840, time 32

It is overfitting quite a bit.  I tried increasing the batch size and the dropout to
no avail.  What other options are there?

## Generating Outputs with our New Model

Now we can leverage the `sample.py` file to generate outputs from our model!

### Generation Set Up and Model Loading

In [35]:
import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT

# -----------------------------------------------------------------------------
init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
out_dir = base_path / 'out' # ignored if init_from is not 'resume'
start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
num_samples = 10 # number of samples to draw
max_new_tokens = 500 # number of tokens generated in each sample
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1337
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

In [36]:
# model
if init_from == 'resume':
    # init from a model saved in a specific directory
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)

number of parameters: 100.49M


In [37]:
model.eval()
model.to(device)
if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

In [43]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(base_path / "tokenizer", unk_token="[UNK]")

enc = tokenizer
encode = lambda s: enc.encode(s)
decode = lambda l: enc.decode(l)

### Generation!

In [44]:
# encode the beginning of the prompt
if start.startswith('FILE:'):
    with open(start[5:], 'r', encoding='utf-8') as f:
        start = f.read()
start_ids = encode(start)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

# run generation
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')



Shepherd:
None, sir; I have no pheasant, cock nor hen.

AUTOLYCUS:
How blessed are we that are not simple men!
Yet nature might have made me as these are,
Therefore I will not disdain.

Clown:
This cannot be but a great courtier.

Shepherd:
His garments are rich, but he wears
them not handsomely.

Clown:
He seems to be the more noble in being fantastical:
a great man, I'll warrant; I know by the picking
on's teeth.

AUTOLYCUS:
The fardel there? what's i' the fardel?
Wherefore that box?

Shepherd:
Sir, there lies such secrets in this fardel and box,
which none must know but the king; and which he
shall know within this hour, if I may come to the
speech of him.

AUTOLYCUS:
Age, thou hast lost thy labour.

Shepherd:
Why, sir?

AUTOLYCUS:
The king is not at the palace; he is gone aboard a
new ship to purge melancholy and air himself: for,
if thou beest capable of things serious, thou must
know the king is full of grief.

Shepard:
So 'tis said, sir; about his son, that should have
married