You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, this is intentional. Basically, there are 2 options which I think make sense for CTCLoss:
"mean" - average everything across sequence length and batch (Notice that this is the default behavior for Pytorch)
Sum losses over sequence lengths and then average over the batch.
We found out empirically that option (2) works best. While longer sequences do make greater impact, in this case, keep in mind that in our setup: (1) we randomly shuffle examples and (2) cap the max duration to 16.7 seconds.
But, perhaps, we should expose ("mean") as an option.
Yeah, I wonder if longer sequences indeed provide more reliable gradients. If it's not the case, then rising learning rate should have somewhat similar impact.
In https://github.com/NVIDIA/NeMo/blob/master/collections/nemo_asr/nemo_asr/losses.py#L51-L53 it is mentioned that NeMo does not divide by target length (which also makes losses less comparable between sequences of different size), effectively scaling up learning rate for longer sequences if I understand well.
Could you please comment on this choice (e.g. versus normalizing by sequence lengths and pushing up the learning rate)? Thank you!
The text was updated successfully, but these errors were encountered: