Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: why not dividing by target length in CTC loss #68

Closed
vadimkantorov opened this issue Oct 24, 2019 · 2 comments
Closed

Question: why not dividing by target length in CTC loss #68

vadimkantorov opened this issue Oct 24, 2019 · 2 comments

Comments

@vadimkantorov
Copy link
Contributor

In https://github.com/NVIDIA/NeMo/blob/master/collections/nemo_asr/nemo_asr/losses.py#L51-L53 it is mentioned that NeMo does not divide by target length (which also makes losses less comparable between sequences of different size), effectively scaling up learning rate for longer sequences if I understand well.

Could you please comment on this choice (e.g. versus normalizing by sequence lengths and pushing up the learning rate)? Thank you!

@okuchaiev
Copy link
Member

Yes, this is intentional. Basically, there are 2 options which I think make sense for CTCLoss:

  1. "mean" - average everything across sequence length and batch (Notice that this is the default behavior for Pytorch)
  2. Sum losses over sequence lengths and then average over the batch.

We found out empirically that option (2) works best. While longer sequences do make greater impact, in this case, keep in mind that in our setup: (1) we randomly shuffle examples and (2) cap the max duration to 16.7 seconds.

But, perhaps, we should expose (1) as an option.

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Oct 24, 2019

(1) we randomly shuffle examples

Don't you sort by duration (so that duration is similar within the batch) by default?
https://github.com/NVIDIA/NeMo/blob/master/collections/nemo_asr/nemo_asr/parts/manifest.py#L129

But, perhaps, we should expose ("mean") as an option.

Yeah, I wonder if longer sequences indeed provide more reliable gradients. If it's not the case, then rising learning rate should have somewhat similar impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants