This notebook studies the largest batch size that can be used to sample and train from a GPT-2 model, by using the estimate_max_batch_size() function.

For an NVIDIA RTX 3060 with 12GB, using model.dtype='bfloat16' (2 bytes per parameter), these batch sizes were estimated:
|Model (Params)|Sample|Train|
|------|------|------|
|gpt2 (124M)|70|16-10|
|gpt2-medium (350M)|70|6-4|
|gpt2-large (774M)|50|3-1|
|gpt2-xl (1558M)|50|0| |

Train column values are separated by a dash: shared GPU memory - dedicated GPU memory (faster)

GPTBench includes gradient accumulation, so the values obtained here can be used in the model.accum_size config setting, while the actual model.batch_size can be greater.
Also, since build 0.2 the GPT model class supports Flash Attention which decreases memory needs.

The memory-hungry AdamW optimizer was used for the numbers above, instead using an SGD optimizer would allow larger batch sizes.

This process is quite messy, as sometimes the Jupyter notebook will lock the allocated memory and only restarting the kernel will clean it.

It also hapenned that a batch size value that used to work before, sometimes gives an out of memory error. One needs to remember that the GPU is shared with other software, so values near full memory occupation might not work at all times.

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

from gptbench import Train, GPT, Conf, empty_config

In [2]:
model_type = 'gpt2' # 'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'

dataset_path='../data/shakespeare.txt'

# set model config settings
model_config = GPT.get_config_from_type(model_type)
model_config.dtype='bfloat16' # not 'float32'
print(model_config)

device=auto
dtype=bfloat16
n_layer=12
n_head=12
n_embd=768
vocab_size=50257
block_size=1024
dropout=0.1
flash_attn=True


In [3]:
# estimate max batch_size for sampling
batch_size = Train.estimate_max_batch_size(model_config, None, starting_size=80, delta_size=-5, times=2)
batch_size

Creating model
Trying batch_size 80... Out of memory
Trying batch_size 75... Out of memory
Trying batch_size 70... Fits
Enough memory for batch_size 70


70

In [4]:
# confirm that we can sample at this batch_size by calling the measure_perplexity()
ben = Train(seed=0xb0ccacc10)
# set config settings
cfg = empty_config()

cfg.dataset.set(class_name='gpt2', 
                train_path=dataset_path, 
                train_split=0.9)
cfg.model = model_config

ben.init_pretrained(model_type, cfg)

print(f"Sampling with batch_size={batch_size}")
ppl = ben.measure_perplexity(ben.val_dataset, stride=-1, max_batch_size=batch_size)
print(f"Measured perplexity={ppl}")

# clean up
del ben
torch.cuda.empty_cache()

Initializing model from gpt2
Dataset: encoding utf-8 to tokens
Dataset: loading uint16 tokens
Dataset: loading uint16 tokens
Dataset train_path: ../data/shakespeare.txt, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 124.44M
Sampling with batch_size=70
Measured perplexity=69.01852867010916


In [5]:
# now estimate for training using AdamW optimizer
batch_size = Train.estimate_max_batch_size(model_config, 'adamw', starting_size=10, delta_size=-2, times=2)
batch_size

Creating model
Creating optimizer adamw
Trying batch_size 10... Fits
Enough memory for batch_size 10


10

In [8]:
# try testing with 
ben = Train(seed=0xb0ccacc10)

# set config settings
cfg = empty_config()

cfg.dataset.set(class_name='gpt2', 
                train_path=dataset_path, 
                train_split=0.9)
cfg.model = model_config
cfg.trainer.batch_size=batch_size
cfg.trainer.optimizer='adamw'

ben.init_pretrained(model_type, cfg)

Initializing model from gpt2
Dataset: encoding utf-8 to tokens
Dataset: loading uint16 tokens
Dataset: loading uint16 tokens
Dataset train_path: ../data/shakespeare.txt, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 124.44M


In [9]:
ben.train(iter_count=20)
ben.sample("Marcus")

Training
.........
Iter 10 loss=4.0000, iter_dt=6828.22ms
..........
Iter 20 loss=3.6875, iter_dt=5771.74ms
.Marcus.

RENEWAL: It's been quite long, Mrs. Holland.

REID: Glad to hear. There are many compliments upon you all, though.

REID: I have learnt a lot from Consome Gable, who was once perhaps your best horticulturist; yet he said to me that the principal causes of the dyspeptic are exercised mainly by laziness itself, and sometimes weakness, that is fumes, and put the fumes into
