Encountering error mid-training #100

renziver · 2020-06-23T04:05:44Z

Describe the bug
Model training is encountering an error after a few steps. The training log is shown below:

Logged Output
2020-06-23 03:52:05,935 Epoch 1 Step: 3900 Batch Loss: 4.291916 Tokens per Sec: 2948, Lr: 0.000300
 2020-06-23 03:52:16,157 Epoch 1 Step: 4000 Batch Loss: 5.365633 Tokens per Sec: 2889, Lr: 0.000300 
Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec)
 File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)
 File "/home/renz/joeynmt/joeynmt/main.py", line 41, in main() File "/home/renz/joeynmt/joeynmt/main.py", line 29, in main train(cfg_file=args.config_path) 
File "/home/renz/joeynmt/joeynmt/training.py", line 653, in train trainer.train_and_validate(train_data=train_data, valid_data=dev_data) 
File "/home/renz/joeynmt/joeynmt/training.py", line 378, in train_and_validate batch_type=self.eval_batch_type
 File "/home/renz/joeynmt/joeynmt/prediction.py", line 88, in validate_on_data for valid_batch in iter(valid_iter): File "/opt/conda/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in iter yield Batch(minibatch, self.dataset, self.device)
 File "/opt/conda/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init setattr(self, name, field.process(batch, device=device))
 File "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 236, in process padded = self.pad(batch) 
File "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 254, in pad max_len = max(len(x) for x in minibatch) ValueError: max() arg is an empty sequence

Config file used
data: 
level: bpe 
max_sent_length: 80 
...

training: 
random_seed: 42
 optimizer: "adam"
 normalization: "tokens" 
adam_betas: [0.9, 0.999] 
scheduling: "plateau"
 patience: 5
 decrease_factor: 0.7 
loss: "crossentropy" 
learning_rate: 0.0003
 learning_rate_min: 0.00000001 
weight_decay: 0.0 
label_smoothing: 0.1 
batch_size: 512
 batch_type: "token" 
eval_batch_size: 256 
eval_batch_type: "token" 
batch_multiplier: 1
 early_stopping_metric: "ppl
" epochs: 100 
validation_freq: 4000
 logging_freq: 100 
eval_metric: "bleu" 
model_dir: "models/one2many"
 overwrite: False shuffle: True
 use_cuda: True
 max_output_length: 100
 print_valid_sents: [0, 1, 2, 3] 
keep_last_ckpts: 3
 model: initializer: "xavier"
 bias_initializer: "zeros" 
init_gain: 1.0 
embed_initializer: "xavier" 
embed_init_gain: 1.0 
tied_embeddings: True 
tied_softmax: True 
encoder: type: "transformer" 
num_layers: 6
 num_heads: 8
 embeddings: 
embedding_dim: 512 
scale: True 
dropout: 0. 
hidden_size: 512 
ff_size: 2048 
dropout: 0.1 
decoder:
 type: "transformer"
 num_layers: 6
 num_heads: 8 
embeddings: 
embedding_dim: 512 
scale: True 
dropout: 0.
 hidden_size:
512 
ff_size: 2048
 dropout: 0.1

I tried to use different batch sizes for the tokens e.g. 4096, 2048, 1028, etc. but i keep on encountering the same error. I checked the dataset I used and it has been properly preprocessed according to the Sockeye paper so I am not sure where's the error is coming from.

juliakreutzer · 2020-06-23T16:38:36Z

Hi @renziver,

it seems like the validation batch is empty, could you please doublecheck that the path to the validation data is valid? The training log should report the size of the validation data, maybe that can help us to debug.

renziver · 2020-06-23T16:59:58Z

Hi @juliakreutzer,

I checked the config file to see if the path is valid and it is indeed correct as verified by the training log:

I checked the files as well to make sure that they aren't empty and they're not.

juliakreutzer · 2020-06-23T22:14:05Z

Thanks, @renziver, I'll take a look. Maybe something broke through the last batch multiplier update. Could you please try with (eval_)batch_type: "sentence", and batch_size something around 64?

renziver · 2020-06-24T05:45:29Z

Hi @juliakreutzer,

I changed the batch type to sentence and the batch size as well and a new error showed up:

2020-06-24 04:33:01,547 Epoch 1 Step: 3900 Batch Loss: 4.006904 Tokens per Sec: 6760, Lr: 0.000300
2020-06-24 04:33:12,622 Epoch 1 Step: 4000 Batch Loss: 3.290356 Tokens per Sec: 6911, Lr: 0.000300
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/main.py", line 41, in
main()
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/main.py", line 29, in main
train(cfg_file=args.config_path)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/training.py", line 653, in train
trainer.train_and_validate(train_data=train_data, valid_data=dev_data)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/training.py", line 378, in train_and_validate
batch_type=self.eval_batch_type
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/prediction.py", line 98, in validate_on_data
batch, loss_function=loss_function)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/model.py", line 133, in get_loss_for_batch
trg_mask=batch.trg_mask)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/model.py", line 80, in forward
trg_mask=trg_mask)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/model.py", line 117, in decode
trg_mask=trg_mask)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/decoders.py", line 510, in forward
x = self.pe(trg_embed) # add position encoding to word embedding
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/renz_baliber_senti_com_ph/joeynmt/joeynmt/transformer_layers.py", line 159, in forward
return emb + self.pe[:, :emb.size(1)]
RuntimeError: The size of tensor a (5141) must match the size of tensor b (5000) at non-singleton dimension 1

I checked if it has something to do with my validation pair but they have the equal numbers of instances:

renz-iver:~/joeynmt$ wc -l data/one2many/valid.bpe.src
5600 data/one2many/valid.bpe.src
renz-iver:~/joeynmt$ wc -l data/one2many/valid.bpe.tgt
5600 data/one2many/valid.bpe.tgt

bastings · 2020-06-24T11:52:37Z

Hi! Do you have a sentence that is longer than 5000? The position embeddings might be limited to 5000. if that's the issue, you can make them go up to 6000 or so.

renziver · 2020-06-24T13:52:13Z

Hi @bastings

Should I do that by increasing the embedding dimensions / hidden size of the transformer?

bastings · 2020-06-24T14:07:42Z

Hi,
Please change 5000 here to 10000 or so:
https://github.com/joeynmt/joeynmt/blob/master/joeynmt/transformer_layers.py#L131

And let us know if that helps.
(Also, you are really feeding a sequence that's that long?)

renziver · 2020-06-24T14:12:06Z

Hi @bastings i will take another look in the data to check why the filtering step didn't work, I included filtering sentences longer than 100 so i would have to double check it. thank you for the help.

renziver · 2020-06-24T16:20:42Z

It is indeed an error in my filtering step. Training's now working on a sentence type batch. Thank you @juliakreutzer and @bastings

tongye98 · 2022-04-22T03:37:58Z

hi, @renziver, i meet the same runtime error in training:
RuntimeError: The size of tensor a (12805) must match the size of tensor b (5000) at non-singleton dimension 1.
In my config yaml, i set max_sent_length: 300, i want to how you find the error in your filtering step?

renziver closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountering error mid-training #100

Encountering error mid-training #100

renziver commented Jun 23, 2020 •

edited

juliakreutzer commented Jun 23, 2020 •

edited

renziver commented Jun 23, 2020

juliakreutzer commented Jun 23, 2020

renziver commented Jun 24, 2020

bastings commented Jun 24, 2020

renziver commented Jun 24, 2020

bastings commented Jun 24, 2020

renziver commented Jun 24, 2020

renziver commented Jun 24, 2020

tongye98 commented Apr 22, 2022

Encountering error mid-training #100

Encountering error mid-training #100

Comments

renziver commented Jun 23, 2020 • edited

juliakreutzer commented Jun 23, 2020 • edited

renziver commented Jun 23, 2020

juliakreutzer commented Jun 23, 2020

renziver commented Jun 24, 2020

bastings commented Jun 24, 2020

renziver commented Jun 24, 2020

bastings commented Jun 24, 2020

renziver commented Jun 24, 2020

renziver commented Jun 24, 2020

tongye98 commented Apr 22, 2022

renziver commented Jun 23, 2020 •

edited

juliakreutzer commented Jun 23, 2020 •

edited