-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train loss is nan or inf #10
Comments
When did it turn into nan or inf? At the beginning of the training or middle of training, could you please upload the training log here, thanks! |
At the beginning of the training (pruned_loss_scaled = 0) the loss trun into nan. After 10000 num_updates, the pruned_loss_scaled was set as 0.1 and the loss turn into inf. Soryy, something went wrong when I upload the log. |
Do you have any sequences that |
The sample rate is 4 depends on 2 maxpooling lalyers. So the tokens U in unlikely to be greater than T. I put some logs here: epoch 3 ; loss inf; num updates 16100 ; lr 0.000704907 |
What iteration did the loss become inf on, and what kind of model were you using? |
The loss become inf at epoch 2, where the pruned_loss_scaled is set to 0.1 The ConformerTransducer model is configured as follows: |
Other configurations of the joiner is as follows: pruned_loss_scaled = 0 if num_updates <= 10000 |
Can you dump the input of the batches that leads to the |
@pkufool perhaps it was not obvious to him how to do this? |
@Butterfly-c Suppose you used pruned loss like this:
You can dump the bad cases as follows:
|
Thanks for your kindly reply! |
Thanks for your suggestion, I'm trying to upload the pruned_bad_case.pt for you to debug the inf issue. It'll take me some time. |
We have compared two models trained with the warp-transducer and the fast-rnnt seperately,but the The GPU usage does not decrease significantly. Intuitively, the training time of the two models is as follows: The models above are both tained with v100-32G-4gpu * 2 (i.e. 8gpu). |
What do you want to express ? |
If the sentence is broken into BPE tokens, it is "too long" if the number of BPE tokens is larger than the number of acoustic frames (after subsampling) of this sentence. |
Some configuration of my environment is as follows: 1、The vocabulary size is 8245,which contains 6726 Chinese characters,1514 bpe subwords and 5 special symbols. As shown in this paper https://arxiv.org/abs/2206.13236 Finally, I have another question about the training time. As shown in the paper, the training time per batch of optimized transducer is over 4 times than fast_rnnt. But the training time per epoch of optimized transducer is just 2 times than fast_rnnt. I really appreciate for your reply. |
I think the comparisons in the paper may have just been for the core RNN-T loss. It does not count the neural net forward, which would not be affected by speedups in the loss computation. |
Thanks for your reply, which solved my confusion. |
Based on your suggestion, I saved some bad cases. What's interesting is that most of the 'ranges' are all zero tensors. For example, when the training sample is a backgound music, the label is only one symbol. The lm.shape and am.shape are as follows: Is the training loss will become inf when the input and output are unbalanced(i.e. input is far smaller than output) ? |
After I filtering the training data as follows, the inf problem has decreased: |
Due to the network limitations, I will share the pruned_bad_case.pt latter. |
Only one sequence has only one symbol? or all the sequences in one batch have only one symbol? |
Based on 40 pruned_bad_case.pts, all of the bad cases are "all the sequences in one batch have only one symbol". And the sum of 'ranges' are all zero tensors. |
OK, Thanks! That's it. I think our code did not handle |
@Butterfly-c If you have problem uploading your bad cases to github, can you send your bad cases to me via email, wkang.pku@gmail.com. I want them to test my fixes, Thanks! |
Due to data permissions, I can't share the bad case information until I get permission. The permission is on the way. |
Ok, I think there won't be any characters and waves in your bad cases, only float and integer numbers. Hope you can get the permissions, I am testing it with random generated bad cases. Thanks. |
OK, I will contact you as soon as I get the permission. |
After updating the fast-rnnt to the version of "fix_s_range", the "inf" problem has been fixed. Thanks! |
After using the fast_rnnt loss in my environment, the trainning loss always failed into nan or inf.
The configuration fo my ConformerTransducer enviroment is as follows:
-optimizer adam
-pruned_loss_scaled = 0 if num_updates <= 10000
pruned_loss_scaled = 0.1 if 10000 < num_updates <= 20000
pruned_loss_scaled = 1 if num_updates > 20000
Finally, 6k hours training data are used to train the RNNT model. At the warmup stage (i.e.pruned_loss_scaled = 0 ), the loss always failed into nan,Also when pruned_loss_scaled is set to 0.1 , the loss always failed into inf.
Is there any suggestions to solve this problem?
The text was updated successfully, but these errors were encountered: