-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Various InvalidArgumentError in Evaluation #103
Comments
It would be great if someone running into this can provide full log files including both training and evaluation. |
Hi Denny Britz, Here is one example log file for both training and evaluation The dataset I used was the WMT En->Fr translation task. I tokenized it using BPE code as mentioned in the seq2seq tutorial. Also, attached is my run script. sample_seq2seq_run_script_sh.txt Thanks, |
Thanks for the log. Everything in the log seems fine to me, unfortunately. Could you try reducing the batch size and see if the error still happens? |
I just tried for smaller batch sizes of 16 and 8, but getting the same error in the number of first dimension and label shape. The log files are almost similar to the one attached above. |
@dennybritz, thanks for seqseq and for looking at this. fr-en.log2.txt As you can see below the reported logits shape keep increasing with subsequent failing runs. It does reset again when running on a different GPU. InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [1280,36240] and labels shape [6272] InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [2304,36240] and labels shape [6272] InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [4480,36240] and labels shape [6272] |
Yes, I think it's a GPU memory sharing issue. It seems likely that it's a bug in Tensorflow. This code doesn't do anything special and just uses the tf.learn estimator that handles all the model construction. Thanks for running these experiments. The fact that the shapes are increasing is very suspicious. I'll create a Tensorflow issue. |
Opened a TF issue: tensorflow/tensorflow#8701 |
Reducing the number of validation records (50 for example) solved the problem for me. |
@amirj |
Just decrease the number of samples in the validation set (for example the first 50 lines!). |
I also got similar problem when following the documentation tutorial exactly. GPU: Titan X Pascal BTW, amirj's suggestion temporarily solves the problem. ty (still outputs error some time) See rihardsk's suggestion below, which works in my case. |
Using a smaller validation set didn't help for me. I tried to decrease it down to just 10 sentences with no luck. The only way to get the training going was to set the |
Removing the buckets line(buckets: 10,20,30,40) from example_configs/train_seq2seq.yml seems to solve the InvalidArgumentError for me |
@chenb67 That's an interesting finding. It's strange because bucketing is disabled during eval anyway and only enabled during training. It could be an issue with I added an extra scope for the input function (#126), can you see if that fixes it with bucketing enabled? |
Hi @dennybritz, |
Removing buckets didn't work for me. I'm currently running with evaluation disabled and performing a manual run of translation and evaluation every once in a while 😆 |
I just added a few GPU memory options to the training script in #137. Could you try the following:
See https://www.tensorflow.org/tutorials/using_gpu for what these options do. Does this solve the issue? |
For everyone having this issue, please answer the following to help debug:
|
And operating system as well, please. |
I tested it with the flags from #137, it didn't help. |
And it runs fine on a CPU 😮 |
One thing I noticed was that the evaluation phase will crash in GPU mode if the bleu scorer throws error due to a complete mismatch (see #106). I was able to at least stop it from crashing by setting Still get a bunch of warnings at evaluation time though:
|
These are normal Tensorflow/tf.learn warnings and completely OK. I just means that the input data queue is exhausted and you've itereated over all the data.
@electroducer That's very strange. Just to make sure, you were seeing the same error as the other people (a shape mismatch) when the BLEU script didn't exit normally? I have a hard time imagining how these could be related. Maybe it is something related to subprocess and GPUs. |
This is the train.py file I used: |
Unfortunately my model crashed again after 62.000 training steps with the following error message below. Do you happen to know what caused it? Is it due to the continuous_train_and_eval perhaps?
|
Here is a reliable reproduction of this bug. @SwordYork's branch seems to fix it. This is the output from the failure. The last few lines are
Steps to reproduce:
git clone https://github.com/coventry/seq2seq-replication
cd seq2seq-replication
sudo bash install-nvidia-docker-build-and-run.sh After much downloading and building, this will start a This branch runs @SwordYork's fix, which at least runs the evaluation step for several minutes (whereas the default |
During the handling of this bug, what's the correct way to learn a model without a validation set?, I just want to trace the loss on the training set and stop the process after a limited number of iterations. |
Thanks everyone for helping debug this. As this seems to an issue with Tensorflow/tf.learn I'm also not sure what the best way to fix this is. I may overwrite the Estimator class as per @SwordYork suggestion and just put it in this code for now until tensorflow/tensorflow#8701 is resolved.
Seems ok to me. |
#173 has the patch. |
Do you happen to know if it is possible that this training schedule affects the performance of the system? When I used the previous repo, before this fix, I managed to get a BLEU around 15 with a small model (200.000 training sentences, 2000 dev set, 2000 test set, batch size 32, 178.500 training steps). Whenever I have tried to replicate this experiment with the exact same configurations afterwards, I have only managed to get something around 5 BLEU. The only difference in configuration, is this new training schedule.. |
@milanv1 I have not suffered this issue. Could you please try the new repo on CPU? |
Thanks for your quick reply. The low BLEU scores were achieved after using the new repo :/ |
@milanv1 I tried using tf 1.0 and tf 1.1 to reproduce the training procedure in the NMT tutotial, but the model always started to overfit at a BLEU score around ~7.5. And I also tried different versions of train.py (before and after the PatchedExperiment commit), nothing changed. (I didn't train on CPU because that could be way too slow.) So it seems the low BLEU score was not resulted from the PatchExperiment commit. @SwordYork Could you please tell me if you have successfully reproduced the tutorial result? |
@milanv1, @kyleyeung I think it may related to the GPU. Because @milanv1 could train it properly using the previous repo on CPU, however fail to train the new repo on GPU. |
I have faced a similar problem while implementing image captioning code in tf 1.1+
|
@SwordYork Thanks for your contributions. |
@liyi193328 You may refer to https://www.tensorflow.org/deploy/distributed, I have replied you in the email. |
@SwordYork I may not express myself.
Now it can run normally when ever in truly distributed environment or fake distributed env in only one machine. Last, I misunderstood the the data batch parallel in multiple gpus in one machine. I think it can do automaticlly in tf.contib.estimator. But after reading the tf.contrib.estimator source code, it's clear we need do batch parallel with mutiple gpus with average grads and loss manully. Any way, it can run now, thanks. |
My error is as follows, and I solve this problem by reduce the evaluation size to less than 100. `W tensorflow/core/framework/op_kernel.cc:993] Out of range: Reached limit of 1 |
@dennybritz @SwordYork I'm receiving the following error, which seems related, but distinct:
However, I'm not using
Machine Specifications: macOS Sierra Version 10.12.5 |
@RylanSchaeffer Did you modify the input pipeline? There may be some problem with the |
@SwordYork I found my bug. Thank you though :) |
I have encountered with the same issue when I try to create a
|
@RylanSchaeffer I have met the same problem with you. Would you mind telling me what your cause of this issue was and how you solved it? |
Sadly I don't remember. It was probably some small, trivial mistake. |
@sysuzyx @RylanSchaeffer I'm facing a similar issue in CPU as well as GPU. Would be able to say how you solved it? |
I think this codebase is busted (i.e. is not working correctly since the edits of April 17th) I'd recommend using Tensor2Tensor or OpenNMT-py. Cheers. |
The error manifests in multiple ways, always during evaluation and some kind of shape error:
Possible causes:
It could be related to these Tensorflow issues:
For everyone having this issue, please answer the following to help debug:
The text was updated successfully, but these errors were encountered: