Can the model learn how to add two 2 digit numbers? How well will it generalize for unseen sequences?

In [2]:
from gptbench import Train, empty_config, LogFlag

To create data from where we'll create train and validation datasets run in the ../dataprep folder:
```
python prepare_addition.py ../data/add2.txt 2 --sep="\n"
```
The script creates add2.txt with entries in the form a+b=cc, one per line.

In [3]:
# Opening it - the first 100 chars
with open('../data/add2.txt', 'r', newline=None) as f:
    data = f.read()
print(data[:100])

0+0=0
0+1=1
0+2=2
0+3=3
0+4=4
0+5=5
0+6=6
0+7=7
0+8=8
0+9=9
0+10=10
0+11=11
0+12=12
0+13=13
0+14=14



In [4]:
# and the last 100:
print(data[-100:])

99+90=189
99+91=190
99+92=191
99+93=192
99+94=193
99+95=194
99+96=195
99+97=196
99+98=197
99+99=198



So, all a+b=c with two digits. We'll split these data into a train and a validation datasets:
- Train with all lines from 0+0=0 till 89+99=188
- Validation with 90+0=90 till 99+99=198

Please note that training never sees sums where the first number is 90+, but it does see numbers 90 and above in the second term of additions, like 10+95=105. Will the model learn 90+10?

We'll load sample via the CharLineDataset: each read sample line is stored in a 16 character block padded at the end.

In [5]:
# create the GPTBench object - we'll name this model add2
ben = Train('add2', log_mask=LogFlag.ALL)
ben.set_seed(0xADD2BEA7)

# set train and validation datasets
ben.set_datasets(class_name='charline', 
                 train_path='../data/add2.txt', 
                 train_split=9000/10000) # split at the start of line with 90+..

# set config settings that will override the default values
cfg = empty_config()
cfg.train.log_period=0
cfg.model.set(n_layer=6, n_head=6, n_embd=90, block_size=16) # our model parameters - block_size is big enough for aa+bb=ccc
cfg.sample.set(top=1, max_batch_size=256) # note the top_k(1) - always pick the best item
cfg.train.set(sample_period=-5)
cfg.trainer.set(batch_size=128)

# and init a new model with config
ben.init_new(cfg)

Initializing new model add2
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


In [6]:
# validation dataset only has entries where the first addition term is 90..99
ben.val_dataset.get_data()[:10], ben.val_dataset.get_data()[-10:]

(['90+0=90',
  '90+1=91',
  '90+2=92',
  '90+3=93',
  '90+4=94',
  '90+5=95',
  '90+6=96',
  '90+7=97',
  '90+8=98',
  '90+9=99'],
 ['99+90=189',
  '99+91=190',
  '99+92=191',
  '99+93=192',
  '99+94=193',
  '99+95=194',
  '99+96=195',
  '99+97=196',
  '99+98=197',
  '99+99=198'])

In [7]:
# Let's train for 3000 batch iterations. 
# Each dot means a batch was trained.
# Train and validation losses are evaluated each 100 iterations (iters). 
# Also each 500 iters a random sample is taken.
ben.train(iter_count=3000)

Training
Batches per epoch: 70
iter 0 (0.000 epoch): loss train=2.2523, val=2.2685, eval->2.2685
==> Saving model at iter=0, eval loss->2.2685 
3
CUDA max memory used: 164.88M
....................................................................................................iter 100 (1.422 epoch): loss train=1.0242, val=1.0641, eval->1.0641
==> Saving model at iter=100, eval loss->1.0641 
....................................................................................................iter 200 (2.844 epoch): loss train=0.8386, val=0.9072, eval->0.9072
==> Saving model at iter=200, eval loss->0.9072 
....................................................................................................iter 300 (4.267 epoch): loss train=0.7672, val=0.8868, eval->0.8868
==> Saving model at iter=300, eval loss->0.8868 
....................................................................................................iter 400 (5.689 epoch): loss train=0.7230, val=0.8593, eval->0.8593
==> S

In [8]:
# No point in training much more because the train loss keeps going down (overfitting) and the validation loss going up (model is not generalizing)
# Let's compare the current state loss info:
ben.state

{'n_samples': 383872,
 'train_loss': 0.4554595947265625,
 'val_loss': 0.9535066485404968,
 'eval_loss': 0.9535066485404968}

In [9]:
# The last saved checkpoint info - the best performing model we got.
ben.last_saved_state

{'n_samples': 204800,
 'train_loss': 0.5231900215148926,
 'val_loss': 0.8066989779472351,
 'eval_loss': 0.8066989779472351}

In [10]:
# last saved checkpoint has lower validation loss, which means more generalization: let's load it
ben.load()
ben.state

Loading checkpoint from ./models/add2/
Checkpoint: iter=1600 (22.756 epoch), loss train=0.5232 val=0.8067 eval->0.8067
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


{'n_samples': 204800,
 'train_loss': 0.5231900215148926,
 'val_loss': 0.8066989779472351,
 'eval_loss': 0.8066989779472351}

In [11]:
# take a few samples: are the sums correct?
ben.sample('1+1=')
ben.sample('34+7=')
ben.sample('78+99=')

1+1=1
34+7=49
78+99=177


In [12]:
# Ugh! Let's measure the accuracy of training dataset - this should be mostly memorization, as the model trained on these data
train_ds = ben.train_dataset

# split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=train_ds.get_data_split(0, len(train_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['0+0=', '0+1=', '0+2=']
['0', '1', '2']


In [13]:
# Measure the accuracy - how good was the memorization?
# This may take a while (and give different results than the number below, if you changed the initial seed)
ben.measure_accuracy(q,a)

0.6481111111111111

In [14]:
# Not good: 64%. Further training would improve accuracy, 
# but the model would be overfitting and memorizing the given samples.
# What about the accuracy of the validation dataset, on which the model never trained?
val_ds = ben.val_dataset

#split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=val_ds.get_data_split(0, len(val_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['90+0=', '90+1=', '90+2=']
['90', '91', '92']


In [15]:
# Validation dataset has sums starting in 90+..99+..., for example 90+2=92.
# The model did however see the reversed addition of 90.100 numbers, for example 2+90=92.
# Did it somehow learn the commutative property of addition?
ben.measure_accuracy(q,a)

0.151

In [17]:
# How is the model failing - let's see some incorrect answers:

wrongs = []
def test(q,a,g):
    global wrongs
    res = float(a == g)
    if not res: wrongs += [f"{q}{a} != {g}"]
    return res

ben.measure_accuracy(q,a, test_fn=test)

wrongs[40:50]

['90+41=131 != 111',
 '90+42=132 != 122',
 '90+43=133 != 123',
 '90+44=134 != 124',
 '90+45=135 != 126',
 '90+46=136 != 127',
 '90+47=137 != 127',
 '90+48=138 != 128',
 '90+49=139 != 128',
 '90+50=140 != 120']

In [18]:
wrongs[200:210]

['92+43=135 != 125',
 '92+44=136 != 126',
 '92+45=137 != 127',
 '92+46=138 != 128',
 '92+47=139 != 129',
 '92+48=140 != 129',
 '92+49=141 != 120',
 '92+50=142 != 123',
 '92+51=143 != 133',
 '92+52=144 != 134']

In [None]:
# In many cases it's off by around -10...

In [19]:
# let's try increaisng dropout from its 0.1 default to improve generalization

# set config settings that will override existing values - only dropout
cfg = empty_config()
cfg.model.set(dropout=0.2)

# init a new model with config
ben.init_new(cfg, name='add2drop')

# see total config:
print(ben.get_config().dump(1))

Initializing new model add2drop
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M
seed: -1
sample: 
    max_len: 100
    count: 1
    start_text: None
    start_text_sep: |
    emit_start: True
    emit_after: None
    emit_before: None
    flush: True
    eot_stop: 0
    top: 1.0
    temp: 1.0
    max_batch_size: 256
    multiline_prompt: False
train: 
    eval_period: 100
    eval_type: 1.0
    eval_iters: 100
    eval_save_checkpt: 1
    eval_save_loss: csv,tensorboard
    sample_period: -5.0
    log_period: 0.0
dataset: 
    class_name: charline
    train_path: ../data/add2.txt
    train_split: 0.9
    val_path: None
    params: None
model: 
    device: auto
    dtype: float32
    n_layer: 6
    n_head: 6
    n_embd: 90
    vocab_size: 13
    block_size: 16
    dropout: 0.2
trainer: 
    n_workers: 0
    batch_size: 128
    max_samples: None
    grad_norm_clip: 1.0
    optimizer: adamw
    learning_rate: 0.0001
    adamw_beta

In [20]:
# train for a bit more - 5000 batch iterations
ben.train(iter_count=5000)

Training
Batches per epoch: 70
iter 0 (0.000 epoch): loss train=2.2124, val=2.2532, eval->2.2532
==> Saving model at iter=0, eval loss->2.2532 
7
CUDA max memory used: 164.88M
....................................................................................................iter 100 (1.422 epoch): loss train=1.0394, val=1.0746, eval->1.0746
==> Saving model at iter=100, eval loss->1.0746 
....................................................................................................iter 200 (2.844 epoch): loss train=0.8433, val=0.9148, eval->0.9148
==> Saving model at iter=200, eval loss->0.9148 
....................................................................................................iter 300 (4.267 epoch): loss train=0.7767, val=0.8772, eval->0.8772
==> Saving model at iter=300, eval loss->0.8772 
....................................................................................................iter 400 (5.689 epoch): loss train=0.7380, val=0.8253, eval->0.8253
==> S

In [21]:
# What's the loss of the best saved state?
ben.last_saved_state

{'n_samples': 332800,
 'train_loss': 0.4711717665195465,
 'val_loss': 0.7789766192436218,
 'eval_loss': 0.7789766192436218}

In [22]:
# Let's measure accuracy with training data
train_ds = ben.train_dataset

#split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=train_ds.get_data_split(0, len(train_ds), sep='=', sep_included=-1)

ben.measure_accuracy(q,a)

0.985

In [23]:
# Not bad, it's now over 98%
# And now with validation data
val_ds = ben.val_dataset

#split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=val_ds.get_data_split(0, len(val_ds), sep='=', sep_included=-1)

ben.measure_accuracy(q,a)

0.886

In [27]:
# Validation accuracy jumped to 88% (from 15%). 
# Let's get an idea of which cases are giving the model a hard time in the validation data:
wrongs = []
ben.measure_accuracy(q,a, test_fn=test)
wrongs

['90+2=92 != 91',
 '90+3=93 != 92',
 '91+0=91 != 90',
 '91+1=92 != 91',
 '91+2=93 != 92',
 '91+3=94 != 93',
 '91+4=95 != 94',
 '91+6=97 != 96',
 '91+9=100 != 90',
 '91+39=130 != 120',
 '91+49=140 != 130',
 '91+59=150 != 140',
 '91+69=160 != 150',
 '92+0=92 != 91',
 '92+1=93 != 92',
 '92+2=94 != 93',
 '92+3=95 != 94',
 '92+4=96 != 95',
 '92+5=97 != 96',
 '92+6=98 != 97',
 '92+8=100 != 90',
 '92+9=101 != 90',
 '92+48=140 != 130',
 '92+58=150 != 140',
 '92+68=160 != 150',
 '93+0=93 != 92',
 '93+1=94 != 93',
 '93+2=95 != 94',
 '93+4=97 != 96',
 '93+7=100 != 90',
 '93+8=101 != 90',
 '93+9=102 != 91',
 '94+1=95 != 96',
 '94+2=96 != 97',
 '94+3=97 != 98',
 '94+6=100 != 90',
 '94+7=101 != 91',
 '94+8=102 != 91',
 '94+9=103 != 92',
 '94+36=130 != 120',
 '94+39=133 != 123',
 '95+1=96 != 98',
 '95+2=97 != 99',
 '95+3=98 != 90',
 '95+4=99 != 90',
 '95+5=100 != 90',
 '95+6=101 != 92',
 '95+7=102 != 92',
 '95+8=103 != 93',
 '95+9=104 != 95',
 '95+38=133 != 123',
 '95+39=134 != 124',
 '96+1=97 != 98'

A pattern is that most errors occur when the second number is single digit - the model is having problems with this, perhaps because it sees relatively little examples (single digits are 10% of two digit examples)...

### More...

Even if dropout reduces overfit, validation data loss is still 

The training dataset (first adding number 0 to 89) and validation dataset (90 to 99) are sharply cut around the 90 boundary and represent different distributions of the data. This is likely to increase training data overfit because the model is being trained on a subset of the data with a different distribution than the whole data (and the validation set).One could say the training set is not representative of the overall distribution of the data. See add_two_digits_shuffled for how shuffling improves the model by a lot.

Perhaps using a zero-padded data format would allow better accuracy, like 82+07=089 ?

Why would a smaller block_size (than the 16 we've used) increase overall loss? Should be the opposite because there are now less characters in immediate memory. 