New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountering error mid-training #100
Comments
Hi @renziver, it seems like the validation batch is empty, could you please doublecheck that the path to the validation data is valid? The training log should report the size of the validation data, maybe that can help us to debug. |
Hi @juliakreutzer, I checked the config file to see if the path is valid and it is indeed correct as verified by the training log: |
Thanks, @renziver, I'll take a look. Maybe something broke through the last batch multiplier update. Could you please try with ( |
Hi @juliakreutzer, I changed the batch type to sentence and the batch size as well and a new error showed up: 2020-06-24 04:33:01,547 Epoch 1 Step: 3900 Batch Loss: 4.006904 Tokens per Sec: 6760, Lr: 0.000300 I checked if it has something to do with my validation pair but they have the equal numbers of instances:
|
Hi! Do you have a sentence that is longer than 5000? The position embeddings might be limited to 5000. if that's the issue, you can make them go up to 6000 or so. |
Hi @bastings Should I do that by increasing the embedding dimensions / hidden size of the transformer? |
Hi, And let us know if that helps. |
Hi @bastings i will take another look in the data to check why the filtering step didn't work, I included filtering sentences longer than 100 so i would have to double check it. thank you for the help. |
It is indeed an error in my filtering step. Training's now working on a sentence type batch. Thank you @juliakreutzer and @bastings |
hi, @renziver, i meet the same runtime error in training: |
Describe the bug
Model training is encountering an error after a few steps. The training log is shown below:
Logged Output
2020-06-23 03:52:05,935 Epoch 1 Step: 3900 Batch Loss: 4.291916 Tokens per Sec: 2948, Lr: 0.000300
2020-06-23 03:52:16,157 Epoch 1 Step: 4000 Batch Loss: 5.365633 Tokens per Sec: 2889, Lr: 0.000300
Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)
File "/home/renz/joeynmt/joeynmt/main.py", line 41, in main() File "/home/renz/joeynmt/joeynmt/main.py", line 29, in main train(cfg_file=args.config_path)
File "/home/renz/joeynmt/joeynmt/training.py", line 653, in train trainer.train_and_validate(train_data=train_data, valid_data=dev_data)
File "/home/renz/joeynmt/joeynmt/training.py", line 378, in train_and_validate batch_type=self.eval_batch_type
File "/home/renz/joeynmt/joeynmt/prediction.py", line 88, in validate_on_data for valid_batch in iter(valid_iter): File "/opt/conda/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in iter yield Batch(minibatch, self.dataset, self.device)
File "/opt/conda/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init setattr(self, name, field.process(batch, device=device))
File "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 236, in process padded = self.pad(batch)
File "/opt/conda/lib/python3.7/site-packages/torchtext/data/field.py", line 254, in pad max_len = max(len(x) for x in minibatch) ValueError: max() arg is an empty sequence
Config file used
data:
level: bpe
max_sent_length: 80
...
training:
random_seed: 42
optimizer: "adam"
normalization: "tokens"
adam_betas: [0.9, 0.999]
scheduling: "plateau"
patience: 5
decrease_factor: 0.7
loss: "crossentropy"
learning_rate: 0.0003
learning_rate_min: 0.00000001
weight_decay: 0.0
label_smoothing: 0.1
batch_size: 512
batch_type: "token"
eval_batch_size: 256
eval_batch_type: "token"
batch_multiplier: 1
early_stopping_metric: "ppl
" epochs: 100
validation_freq: 4000
logging_freq: 100
eval_metric: "bleu"
model_dir: "models/one2many"
overwrite: False shuffle: True
use_cuda: True
max_output_length: 100
print_valid_sents: [0, 1, 2, 3]
keep_last_ckpts: 3
model: initializer: "xavier"
bias_initializer: "zeros"
init_gain: 1.0
embed_initializer: "xavier"
embed_init_gain: 1.0
tied_embeddings: True
tied_softmax: True
encoder: type: "transformer"
num_layers: 6
num_heads: 8
embeddings:
embedding_dim: 512
scale: True
dropout: 0.
hidden_size: 512
ff_size: 2048
dropout: 0.1
decoder:
type: "transformer"
num_layers: 6
num_heads: 8
embeddings:
embedding_dim: 512
scale: True
dropout: 0.
hidden_size:
512
ff_size: 2048
dropout: 0.1
I tried to use different batch sizes for the tokens e.g. 4096, 2048, 1028, etc. but i keep on encountering the same error. I checked the dataset I used and it has been properly preprocessed according to the Sockeye paper so I am not sure where's the error is coming from.
The text was updated successfully, but these errors were encountered: