Train a transformer model to convert decimal numbers to roman numerals, ex:
56=LVI

We'll pass '56=' as the starting text and look at what the model outputs after the '=' character.

To create training+validation data, run in the ../dataprep folder:
```
python prepare_roman.py ../data/decimal2roman10k.txt 10000 --sep=\n
```
The script creates decimal2roman10k.txt with entries in the form decimal=roman, one per line.

About roman numerals: https://en.wikipedia.org/wiki/Roman_numerals

In [6]:
from gptbench import Train, empty_config

In [3]:
ben = Train('dec2roman', seed=0xbeebaca)

# set training log periods to avoid cluttering the training output
ben.set_train_log_periods(sample_period=500, dot_period=1, loss_period=0)

# set datasets: shuffle the datasets before splitting
ben.set_datasets(class_name='charline', 
                 train_path='../data/decimal2roman10k.txt', 
                 train_split=(9000-1)/10000,
                 pre_shuffle=True)

# set config settings
cfg = empty_config()
cfg.model.set(n_layer=6, n_head=6, n_embd=90, block_size=32)
cfg.trainer.set(batch_size=128)
cfg.sample.set(top=1, max_batch_size=256) # top_k(1) - always pick the best item

# and init a new model with config
ben.init_new(cfg)

Initializing new model dec2roman
Dataset train_path: ../data/decimal2roman10k.txt, val_path: None, train_split: 0.8999, vocab_size: 19
Model params: 0.59M


In [4]:
# a peek at the validation dataset
ben.val_dataset.get_data()[:20]

['2209=MMCCIX',
 '5913=MMMMMCMXIII',
 '507=DVII',
 '8029=MMMMMMMMXXIX',
 '3685=MMMDCLXXXV',
 '7422=MMMMMMMCDXXII',
 '8805=MMMMMMMMDCCCV',
 '8390=MMMMMMMMCCCXC',
 '4128=MMMMCXXVIII',
 '7937=MMMMMMMCMXXXVII',
 '4076=MMMMLXXVI',
 '8075=MMMMMMMMLXXV',
 '5783=MMMMMDCCLXXXIII',
 '6607=MMMMMMDCVII',
 '3620=MMMDCXX',
 '6623=MMMMMMDCXXIII',
 '651=DCLI',
 '2822=MMDCCCXXII',
 '7117=MMMMMMMCXVII',
 '9709=MMMMMMMMMDCCIX']

In [5]:
# let's train for this many iters:
ben.train(iter_count=3000)

Training
Iters per epoch: 70
Iter 0 (0.000 epoch): loss train=2.6152, val=2.6174, eval->2.6174
==> Saving model at iter=0, eval loss->2.6174 
Sampling: D
CUDA max memory used: 331.81M
...................................................................................................
Iter 100 (1.422 epoch): loss train=1.1048, val=1.1043, eval->1.1043
==> Saving model at iter=100, eval loss->1.1043 
...................................................................................................
Iter 200 (2.845 epoch): loss train=0.7043, val=0.7023, eval->0.7023
==> Saving model at iter=200, eval loss->0.7023 
...................................................................................................
Iter 300 (4.267 epoch): loss train=0.5250, val=0.5249, eval->0.5249
==> Saving model at iter=300, eval loss->0.5249 
...................................................................................................
Iter 400 (5.690 epoch): loss train=0.4119, val=0.4125, eval->0.41

In [9]:
# Let's load the best saved checkpoint. Train and validation losses are almost equal which is good.
ben.load()
ben.state

Loading checkpoint from ./checkpoints/dec2roman/
Checkpoint: iter=2900 (41.249 epoch), loss train=0.2199 val=0.2202 eval->0.2202
Dataset train_path: ../data/decimal2roman10k.txt, val_path: None, train_split: 0.8999, vocab_size: 19
Model params: 0.59M


{'n_samples': 371200,
 'train_loss': 0.2198736071586609,
 'val_loss': 0.22023971378803253,
 'eval_loss': 0.22023971378803253}

In [10]:
# To capture  accuracy test entries, we could simply pass a log_list and receive the bad (or good) results.
# But we can also pass a custom test function, that can also capture accuracy test entries:
ds = ben.val_dataset
q,a=ds.get_data_split(0, len(ds), sep='=', sep_included=-1)

errs = []
def test(q,a,g):
    global errs
    
    res = float(a == g)
    if not res:
        errs += [f"{q}{a} != {g}"]
    return res
    
print(ben.measure_accuracy(q,a, test_fn=test))
print(f'{len(errs)}/{len(ds)} errors: {errs[:20]}')

0.998001998001998
2/1001 errors: ['331=CCCXXXI != CCCXXI', '4=IV != I']


In [11]:
# Almost 100% accuracy, 2 errors out of 1001 entries. Not perfect but not bad for validation data, which was unseen during traning.
# What about the train dataset's accuracy?
ds = ben.train_dataset
q,a=ds.get_data_split(0, len(ds), sep='=', sep_included=-1)

errs = []
print(ben.measure_accuracy(q,a, test_fn=test))
print(f'{len(errs)}/{len(ds)} errors: {errs[:20]}')

0.9958884320480054
37/8999 errors: ['831=DCCCXXXI != DCCCXXI', '37=XXXVII != XXVII', '21=XXI != XII', '79=LXXIX != LXIX', '381=CCCLXXXI != CCCLXXI', '33=XXXIII != XXIII', '36=XXXVI != XXVI', '881=DCCCLXXXI != DCCCLXXI', '39=XXXIX != XXIX', '96=XCVI != XVI', '9=IX != I', '3=III != II', '989=CMLXXXIX != CMLXXIX', '89=LXXXIX != XXXIX', '31=XXXI != XXII', '26=XXVI != XVI', '46=XLVI != XVIV', '38=XXXVIII != XXVIII', '75=LXXV != LXV', '481=CDLXXXI != CDLXXI']


In [13]:
# Also near 100% accuracy.
#Let's take a few samples:
ben.sample('17=')
ben.sample('225=')
ben.sample('999=')
ben.sample('9999=')

17=XVII
225=CCXXV
999=CMXCIX
9999=MMMMMMMMMCMXCIX


Would more training get us to 100% accuracy?

Also see the roman2decimal notebook for the inverse mapping.