Can the model learn how to add two 2 digit numbers?

How well will it generalize for unseen sequences?

In [1]:
from gptbench import Train, empty_config, LogFlag

To create data from where we'll create train and validation datasets run in the ../dataprep folder:
```
python prepare_addition.py ../data/add2.txt 2 --sep="\n"
```
The script creates add2.txt with entries in the form a+b=cc, one per line.

In [2]:
# Opening the data samples - the first 100 chars
with open('../data/add2.txt', 'r', newline=None) as f:
    data = f.read()
print(data[:100])

0+0=0
0+1=1
0+2=2
0+3=3
0+4=4
0+5=5
0+6=6
0+7=7
0+8=8
0+9=9
0+10=10
0+11=11
0+12=12
0+13=13
0+14=14



In [3]:
# and the last 100:
print(data[-100:])

99+90=189
99+91=190
99+92=191
99+93=192
99+94=193
99+95=194
99+96=195
99+97=196
99+98=197
99+99=198



All entries in the form a+b=c. We'll split these data into train and validation datasets:
- Train includes all samples from 0+0=0 till 89+99=188
- Validation includes sample from 90+0=90 till 99+99=198

Please note that training never sees sums where the first number is 90+, but it does see numbers 90 and above in the second term of additions like 10+95=105. From this, will the model be able to learn 90+10?

We'll load samples via the CharLineDataset class: each read sample line is stored in a 16 character block padded at the end.

In [4]:
# create the GPTBench object - we'll name this model add2
ben = Train('add2', seed=0xADD2BEA7)

# set training log periods to avoid cluttering the output below
ben.set_train_log_periods(sample_period=500, dot_period=1, loss_period=0)

# set train and validation datasets
ben.set_datasets(class_name='charline', 
                 train_path='../data/add2.txt', 
                 train_split=9000/10000) # split at the start of line with 90+..

# set config settings that will override the default values
cfg = empty_config()
cfg.model.set(n_layer=6, n_head=6, n_embd=90, block_size=16) # our model parameters - block_size is big enough for aa+bb=ccc
cfg.sample.set(top=1, max_batch_size=256) # note the top_k(1) - always pick the best item
cfg.trainer.set(batch_size=128)

# and init a new model with config
ben.init_new(cfg)

Initializing new model add2
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


In [5]:
# confirm that validation dataset only has entries where the first addition term is 90..99
ben.val_dataset.get_data()[:10], ben.val_dataset.get_data()[-10:]

(['90+0=90',
  '90+1=91',
  '90+2=92',
  '90+3=93',
  '90+4=94',
  '90+5=95',
  '90+6=96',
  '90+7=97',
  '90+8=98',
  '90+9=99'],
 ['99+90=189',
  '99+91=190',
  '99+92=191',
  '99+93=192',
  '99+94=193',
  '99+95=194',
  '99+96=195',
  '99+97=196',
  '99+98=197',
  '99+99=198'])

In [6]:
# Let's train for 3000 batch iterations. 
# Each dot means a batch was trained.
# Train and validation losses are evaluated each 100 iterations (iters). 
# Also each 500 iters a random sample is taken.
ben.train(iter_count=3000)

Training
Iters per epoch: 70
Iter 0 (0.000 epoch): loss train=2.2523, val=2.2685, eval->2.2685
==> Saving model at iter=0, eval loss->2.2685 
Sampling: 3
CUDA max memory used: 164.88M
...................................................................................................
Iter 100 (1.422 epoch): loss train=1.0242, val=1.0641, eval->1.0641
==> Saving model at iter=100, eval loss->1.0641 
...................................................................................................
Iter 200 (2.844 epoch): loss train=0.8386, val=0.9072, eval->0.9072
==> Saving model at iter=200, eval loss->0.9072 
...................................................................................................
Iter 300 (4.267 epoch): loss train=0.7672, val=0.8868, eval->0.8868
==> Saving model at iter=300, eval loss->0.8868 
...................................................................................................
Iter 400 (5.689 epoch): loss train=0.7230, val=0.8593, eval->0.85

In [7]:
# No point in training much more because the train loss keeps going down (it's overfitting),
# while the validation loss keeps going up, so the model is not generalizing.
# Let's compare the current state loss info:
ben.state

{'n_samples': 383872,
 'train_loss': 0.4554595947265625,
 'val_loss': 0.9535066485404968,
 'eval_loss': 0.9535066485404968}

In [8]:
# The last saved checkpoint info - the best performing model we got.
ben.last_saved_state

{'n_samples': 204800,
 'train_loss': 0.5231900215148926,
 'val_loss': 0.8066989779472351,
 'eval_loss': 0.8066989779472351}

In [9]:
# last saved checkpoint has lower validation loss, which means more generalization, so let's load it
ben.load()
ben.state

Loading checkpoint from ./checkpoints/add2/
Checkpoint: iter=1600 (22.756 epoch), loss train=0.5232 val=0.8067 eval->0.8067
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


{'n_samples': 204800,
 'train_loss': 0.5231900215148926,
 'val_loss': 0.8066989779472351,
 'eval_loss': 0.8066989779472351}

In [10]:
# take a few samples: are the sums correct?
ben.sample('1+1=')
ben.sample('34+7=')
ben.sample('78+99=')

1+1=1
34+7=49
78+99=177


In [11]:
# Ugh - only the third sum is right!
# Let's measure the accuracy of training dataset - 
# this should be mostly memorization, as the model trained on these data
train_ds = ben.train_dataset

# split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=train_ds.get_data_split(0, len(train_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['0+0=', '0+1=', '0+2=']
['0', '1', '2']


In [12]:
# Measure the accuracy - how good was the memorization?
# This may take a while (and give different results than the number below, if you changed the initial seed)
ben.measure_accuracy(q,a)

0.6481111111111111

In [13]:
# Not good: about 64%. Further training could improve accuracy, 
# but the model would be overfitting and memorizing the given samples.
# What about the accuracy of the validation dataset, on which the model never trained?
val_ds = ben.val_dataset

# split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=val_ds.get_data_split(0, len(val_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['90+0=', '90+1=', '90+2=']
['90', '91', '92']


In [14]:
# Remember that validation dataset only has sums starting in 90+..99+..., for example 90+2=92.
# The model did however see the reversed addition of 90.100 numbers, for example 2+90=92.
# Did it somehow learn the commutative property of addition?
ben.measure_accuracy(q,a)

0.151

In [15]:
# Terrible: about 15%!
# How is the model failing - let's see some incorrect answers:

wrongs = []
ben.measure_accuracy(q,a, log_list=wrongs, log_cond=-0.5)

# first column is the start_text, second is the right answer, third is the generated text
wrongs[40:50]

[('90+41=', '131', '111'),
 ('90+42=', '132', '122'),
 ('90+43=', '133', '123'),
 ('90+44=', '134', '124'),
 ('90+45=', '135', '126'),
 ('90+46=', '136', '127'),
 ('90+47=', '137', '127'),
 ('90+48=', '138', '128'),
 ('90+49=', '139', '128'),
 ('90+50=', '140', '120')]

In [16]:
# In many cases the generated entries are off by around -10 from the right answer...
wrongs[200:210]

[('92+43=', '135', '125'),
 ('92+44=', '136', '126'),
 ('92+45=', '137', '127'),
 ('92+46=', '138', '128'),
 ('92+47=', '139', '129'),
 ('92+48=', '140', '129'),
 ('92+49=', '141', '120'),
 ('92+50=', '142', '123'),
 ('92+51=', '143', '133'),
 ('92+52=', '144', '134')]

In [17]:
# let's try increasing model dropout from its 0.1 default, to improve generalization

# set config settings that will override existing values - only dropout changes
cfg = empty_config()
cfg.model.set(dropout=0.2)

# init a new model with config
ben.init_new(cfg, name='add2drop')

# list total config:
print(ben.get_config().dump(1))

Initializing new model add2drop
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M
seed: -1
sample: 
    max_len: 100
    count: 1
    start_text: None
    start_text_sep: |
    emit_start: True
    emit_after: None
    emit_before: None
    flush: True
    eot_stop: 0
    top: 1.0
    temp: 1.0
    max_batch_size: 256
    multiline_prompt: False
train: 
    eval_period: 100
    eval_type: 1.0
    eval_iters: 100
    eval_save_checkpt: 1
    eval_save_loss: csv,tensorboard
dataset: 
    class_name: charline
    train_path: ../data/add2.txt
    train_split: 0.9
    val_path: None
    params: None
model: 
    device: auto
    dtype: float32
    n_layer: 6
    n_head: 6
    n_embd: 90
    vocab_size: 13
    block_size: 16
    dropout: 0.2
trainer: 
    n_workers: 0
    batch_size: 128
    max_samples: None
    grad_norm_clip: 1.0
    optimizer: adamw
    learning_rate: 0.0001
    adamw_beta1: 0.9
    adamw_beta2: 0.95
    adamw_weigh

In [18]:
# train for a bit more this time - 5000 batch iterations
ben.train(iter_count=5000)

Training
Iters per epoch: 70
Iter 0 (0.000 epoch): loss train=2.2122, val=2.2529, eval->2.2529
==> Saving model at iter=0, eval loss->2.2529 
Sampling: 7
CUDA max memory used: 164.88M
...................................................................................................
Iter 100 (1.422 epoch): loss train=1.0401, val=1.0829, eval->1.0829
==> Saving model at iter=100, eval loss->1.0829 
...................................................................................................
Iter 200 (2.844 epoch): loss train=0.8437, val=0.9237, eval->0.9237
==> Saving model at iter=200, eval loss->0.9237 
...................................................................................................
Iter 300 (4.267 epoch): loss train=0.7785, val=0.8864, eval->0.8864
==> Saving model at iter=300, eval loss->0.8864 
...................................................................................................
Iter 400 (5.689 epoch): loss train=0.7375, val=0.8506, eval->0.85

In [19]:
# What's the loss of the best saved state?
ben.last_saved_state

{'n_samples': 204800,
 'train_loss': 0.5531212091445923,
 'val_loss': 0.818952739238739,
 'eval_loss': 0.818952739238739}

In [20]:
# Let's measure accuracy with training data first
train_ds = ben.train_dataset

# split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=train_ds.get_data_split(0, len(train_ds), sep='=', sep_included=-1)

ben.measure_accuracy(q,a)

0.9847777777777778

In [21]:
# Not bad, it's now over 98% (from 64% above)
# And now with validation data:
val_ds = ben.val_dataset

# split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=val_ds.get_data_split(0, len(val_ds), sep='=', sep_included=-1)

ben.measure_accuracy(q,a)

0.737

In [22]:
# Validation accuracy jumped to 73% (from 15%). You may see a different accuracy 
# Let's get an idea of which cases are giving the model a hard time in the validation data:
wrongs = []
ben.measure_accuracy(q,a, log_list=wrongs, log_cond=-0.5)
wrongs

[('90+4=', '94', '93'),
 ('90+30=', '120', '110'),
 ('90+40=', '130', '120'),
 ('90+47=', '137', '127'),
 ('90+48=', '138', '128'),
 ('90+50=', '140', '130'),
 ('90+57=', '147', '137'),
 ('90+58=', '148', '138'),
 ('90+60=', '150', '140'),
 ('91+2=', '93', '92'),
 ('91+3=', '94', '93'),
 ('91+4=', '95', '94'),
 ('91+9=', '100', '90'),
 ('91+29=', '120', '110'),
 ('91+39=', '130', '120'),
 ('91+40=', '131', '121'),
 ('91+46=', '137', '127'),
 ('91+47=', '138', '128'),
 ('91+49=', '140', '130'),
 ('91+57=', '148', '138'),
 ('92+0=', '92', '91'),
 ('92+1=', '93', '92'),
 ('92+2=', '94', '93'),
 ('92+3=', '95', '94'),
 ('92+4=', '96', '95'),
 ('92+5=', '97', '96'),
 ('92+6=', '98', '97'),
 ('92+8=', '100', '90'),
 ('92+28=', '120', '110'),
 ('92+38=', '130', '120'),
 ('92+39=', '131', '121'),
 ('92+46=', '138', '128'),
 ('92+48=', '140', '130'),
 ('92+49=', '141', '131'),
 ('93+0=', '93', '92'),
 ('93+1=', '94', '93'),
 ('93+2=', '95', '94'),
 ('93+3=', '96', '95'),
 ('93+4=', '97', '96'),

Many errors occcur when the second number is single digit and also between 30 and 60.

Single digits could be explained, because the model sees relatively little examples (single digits are 10% of two digit examples).

But the 30..60 ranges are weird.

### More...

Even if dropout reduces overfit, validation data loss is still quite bad.

The training dataset (first adding number between 0 and 89) and validation dataset (90 to 99) are sharply cut around the 90 boundary and represent different distributions of the data. This is likely to increase training data overfit because the model is being trained on a subset of the data with a different distribution than the whole data with the validation set. One could say the training set is not representative of the overall distribution of the data.

See add_two_digits_shuffled for how shuffling improves the model by a lot.

Perhaps using a zero-padded data format would allow better accuracy, like 82+07=089 ?

From other experiences I noted a smaller model.block_size (than the 16 we've used) increases overall loss, which is wird. Should be the opposite, because there are now less characters in immediate memory to handle. 