Let's train the gpt2 model on Shakespeare's plays - this time using a larger trainer.batch_size setting and gradient accumulation so that it fits consummer GPU memory sizes.

We'll be trying to train the 'gpt2-medium' model (350M params). 
On a 12GB RTX 3060 GPU gpt2-medium can be trained with a (non-accumulating) batch_size of 4. If it doesn't fit your GPU, change to the smaller model: just modify 'gpt2-medium' in the line:
```python
ben.init_pretrained('gpt2-medium', cfg)
```
Change to 'gpt2' (124M).

See notebook shakespeare_gpt2 for the non-gradient-accumulating version.

In [1]:
from gptbench import Train, empty_config

In [2]:
ben = Train('gpt2-accum', seed=0xC1A551C)

# Let's set the shakespeare.txt data:
# train_split=1 means no validation dataset, to maximize training data
ben.set_datasets(class_name='gpt2', # GPT2TokensDataset class
                 train_path='../data/shakespeare.txt', 
                 train_split=1.) 

cfg = empty_config()

# use 16-bit floats for half the storage per param
cfg.model.dtype='bfloat16'

# set the accumulating batch_size to 2, while having the batch_size as 16.
cfg.trainer.set(batch_size=16, accum_size=2)

# if you get an out of memory error, change to 'gpt2':
ben.init_pretrained('gpt2-medium', cfg)

Initializing model from gpt2-medium
Dataset: encoding utf-8 to tokens
Dataset: loading uint16 tokens
Dataset train_path: ../data/shakespeare.txt, val_path: None, train_split: 1.0, vocab_size: 50257
Model params: 354.82M


In [3]:
# set training log periods
ben.set_train_log_periods(sample_period=20, dot_period=1, loss_period=10)

# and train for 100 iters
ben.train(iter_count=100)

Training
Iter 0 (0.000 epoch): loss train=3.5583, val=inf, eval->3.5583
==> Saving model at iter=0, eval loss->3.5583 
.Sampling:  peasants and childless farms having gone before, in arms with their innovators. The barren silent neighbourhood was a purerage of monied units, the roads bereft of their lines of defence, blown away in vast reasons, completely vacated of their established nuisances, defectionations shew'd before them how far the time of fighting had gone, and benefited in the end, in the snares of asphalt and planking, and military fashion, how widespread and uniform' was all the destruction
CUDA max memory used: 9292.57M
.........
Iter 10 loss=3.4023, iter_dt=4271.40ms
..........
Iter 20 loss=3.4355, iter_dt=4313.52ms
.Sampling: sha, the forest's Because he want himself but occasioned 2Evil royal frenzies of bodily crimes. And moreover (for he is cowardly and dissolute) (working in humble business),it will mean defeat to any unarmed approach from him, for to themselves he 

In [9]:
# evaluate loss after 100 training iterations:
config = ben.get_config()
ben.estimate_loss(ben.train_dataset, None, 4, 25)

[3.2143750190734863, None]

Loss of 3.21 at 100 training iters is better than the 3.53 of the smaller GPT2 model (shakespeare_gpt2 notebook) at 100 iters. This is probably due this model begin larger, it's learning faster.

In [10]:
ben.sample("So it goes")

So it goes - I have done quite well.

SCARABIC:
Seven is a good one and I shall hope to get both
in any mischance
Upon your supposed existence: look you,
I cannot read beyond one point - some one pleas
devil to trace and distort, yet all
shall find it the wont of my purpose. Sir
sir Sanctuaries have the best marvellous sport
In snowy Trobriand since they escaped
Five last years


In [11]:
ben.sample("Bermuda")

Bermuda's (Massachusett) supply of timber is another of the humbling impediments to his present insurrection. The wages of common soldiers are all of their allowance; of few they cannot be increased, They must make their answers into regard which hurt four halfpence to one pound; Of three tens there must stand the rent of privation. Their profits would not till put off;—but they presently bid them break their face! Of them, Dadlebubble!
I have proper liberty
