Can the model learn how to add two 2 digit numbers, from a shuffled dataset?

The "add_two_digits" notebook sharply divided the test dataset (0..89 for first adding term) and validation dataset (90..99). This hurts generalization, because the distribution of the training data is not representative of the whole data - it misses 90..99 additions.

So here we shuffle before splitting, hoping that the model is now trained on a more representative distribution of the entire data.

In [1]:
from gptbench import Train, empty_config, LogFlag

We'll load ../data/add2.txt from the add_two_digits notebook, which can be created by running in the dataprep folder:
```
python prepare_addition.py ../data/add2.txt 2 --sep="\n"
```
The creates add2.txt

In [2]:
# Opening it - the first 100 chars
with open('../data/add2.txt', 'r', newline=None) as f:
    data = f.read()
print("first:", data[:100])
print("last:", data[-100:])

first: 0+0=0
0+1=1
0+2=2
0+3=3
0+4=4
0+5=5
0+6=6
0+7=7
0+8=8
0+9=9
0+10=10
0+11=11
0+12=12
0+13=13
0+14=14

last: 99+90=189
99+91=190
99+92=191
99+93=192
99+94=193
99+95=194
99+96=195
99+97=196
99+98=197
99+99=198



In [3]:
# We'll load these data samples into two CharLineDatasets, taking care to shuffle the data before splitting

In [4]:
# create the GPTBench object - we'll name this model add2
ben = Train('add2_shuffled', log_mask=LogFlag.ALL)
ben.set_seed(0xADD2B055)

# set train and validation datasets
ben.set_datasets(class_name='charline', # id for the PaddedLineCharDataset class
                 train_path='../data/add2.txt', 
                 train_split=0.9,
                 pre_shuffle=True)

# set config settings that will override the default values
cfg = empty_config()
cfg.train.log_period=0
cfg.model.set(n_layer=6, n_head=6, n_embd=90, block_size=16) # our model parameters - block_size is big enough for aa+bb=ccc
cfg.sample.set(top=1, max_batch_size=256) # note the top_k(1) - always pick the best item
cfg.train.set(sample_period=-5)
cfg.trainer.set(batch_size=128)

# and init a new model with config
ben.init_new(cfg)

Initializing new model add2_shuffled
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


In [5]:
# both train and validation datasets use shuffled data from the add2.txt source file
ben.train_dataset.get_data()[:10], ben.val_dataset.get_data()[:10]

(['25+47=72',
  '57+16=73',
  '3+59=62',
  '24+18=42',
  '53+3=56',
  '2+3=5',
  '28+67=95',
  '72+13=85',
  '54+52=106',
  '26+21=47'],
 ['18+25=43',
  '9+72=81',
  '64+75=139',
  '74+21=95',
  '54+37=91',
  '18+74=92',
  '42+11=53',
  '48+57=105',
  '31+41=72',
  '5+38=43'])

In [6]:
# Let's train for 10000 batch iterations. 
# Each dot means a batch was trained.
# Train and validation losses are evaluated each 100 iterations (iters). 
# Also each 500 iters a random sample is taken.
ben.train(iter_count=10000)

Training
Batches per epoch: 70
iter 0 (0.000 epoch): loss train=2.1426, val=2.1427, eval->2.1427
==> Saving model at iter=0, eval loss->2.1427 
0=
CUDA max memory used: 164.88M
....................................................................................................iter 100 (1.422 epoch): loss train=1.0543, val=1.0541, eval->1.0541
==> Saving model at iter=100, eval loss->1.0541 
....................................................................................................iter 200 (2.845 epoch): loss train=0.8524, val=0.8539, eval->0.8539
==> Saving model at iter=200, eval loss->0.8539 
....................................................................................................iter 300 (4.267 epoch): loss train=0.7820, val=0.7822, eval->0.7822
==> Saving model at iter=300, eval loss->0.7822 
....................................................................................................iter 400 (5.690 epoch): loss train=0.7295, val=0.7294, eval->0.7294
==> 

In [7]:
# The current state loss info:
ben.state

{'n_samples': 1279872,
 'train_loss': 0.43727046251296997,
 'val_loss': 0.43912574648857117,
 'eval_loss': 0.43912574648857117}

In [8]:
# The last saved checkpoint info - the best performing model we got. Both train and val losses are thus lower than above.
ben.last_saved_state

{'n_samples': 1190400,
 'train_loss': 0.4368865191936493,
 'val_loss': 0.43814781308174133,
 'eval_loss': 0.43814781308174133}

In [9]:
# last saved checkpoint has a bit lower validation loss: let's load it
ben.load()
ben.state

Loading checkpoint from ./models/add2_shuffled/
Checkpoint: iter=9300 (132.267 epoch), loss train=0.4369 val=0.4381 eval->0.4381
Dataset train_path: ../data/add2.txt, val_path: None, train_split: 0.9, vocab_size: 13
Model params: 0.59M


{'n_samples': 1190400,
 'train_loss': 0.4368865191936493,
 'val_loss': 0.43814781308174133,
 'eval_loss': 0.43814781308174133}

In [10]:
# take a few samples:
ben.sample('1+1=')
ben.sample('34+7=')
ben.sample('78+99=')

1+1=2
34+7=41
78+99=177


In [11]:
# Much better now - all are correct
# Let's measure the accuracy of training dataset - this should be mostly memorization, as the model trained on these data
train_ds = ben.train_dataset

#split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=train_ds.sample_split(0, len(train_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['96+30=', '91+85=', '75+11=']
['126', '176', '86']


In [12]:
# Measure the accuracy - how good was the memorization? This may take a while and give different results than the number below
ben.measure_accuracy(q,a)

1.0

In [13]:
# Perfect accuracy!
# What about the accuracy of the validation dataset, on which the model never trained?
val_ds = ben.val_dataset

#split each aa+bb=cc into a prompt: 'aa+bb=' and an answer 'cc'
q,a=val_ds.sample_split(0, len(val_ds), sep='=', sep_included=-1)

print(q[:3])
print(a[:3])

['31+19=', '80+54=', '96+68=']
['50', '134', '164']


In [14]:
# Validation dataset has sums starting in 90+..99+..., for example 90+2=92.
# The model did however see the reversed addition of 90.100 numbers, for example 2+90=92.
# Did it somehow learn the commutative property of addition?
ben.measure_accuracy(q,a)

1.0

In [20]:
# Also perfect acuracy - it's generalizing!
# What about three digit sums?
ben.sample('101+120=')

101+120=


Nothing come sout of it for three digits. Perhaps a new project: three digits addition?