Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable_ngram_loss fix for prophetnet #8554

Merged
merged 7 commits into from
Nov 19, 2020

Conversation

Zhylkaaa
Copy link
Contributor

What does this PR do?

This PR fixes disable_ngram_loss behaviour for ProphetNetForConditionalGeneration and is related to #8553

Fixes #8553

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

I guess @patrickvonplaten was using this model (I saw models on hub), sorry if I am wrong, but there is no one to tag for ProphetNet

@patrickvonplaten
Copy link
Contributor

Hey @Zhylkaaa,

Thanks a lot for your PR! @qiweizhen - could you maybe take a look and give your opinion on this PR? I don't have much experience with training ProphetNet

@qiweizhen
Copy link
Contributor

qiweizhen commented Nov 16, 2020

Hi @patrickvonplaten , thanks for informing me. It seems it's still related to the padding tokens (default -100 or padding_idx) which should not be calculated loss.
In the old version code, expend_targets is filled with self.padding_idx however in the loss function, padding_idx is not fed in. It results in calculating wrong loss.

Here @Zhylkaaa set them consistent. I suggest that 1) the outside data preprocess padding function, 2) here expend_targets and 3) the loss function to be consistent.

If Huggingface Transformers default uses -100 for all the NLG models padding, then this code can be merged. If Huggingface Transformers default uses self.padding_idx for all the NLG models padding, then not merge this code, but feed padding_idx into the loss function.

@Zhylkaaa
Copy link
Contributor Author

Thanks @qiweizhen for reviewing,
to my knowledge Huggingface Transformers use -100 to indicate ignored tokens during loss calculations, also I wanted to ask if reduction strategy is important (Transformers use reduction='mean' to my knowledge, but hear it is set to 'sum') because for me it's not?
I also noticed that I haven't changed this behaviour in another ProphetNet model, I will examine if it's necessary and commit changes, also I will write some tests to check for this behaviour in nearest future.

@qiweizhen
Copy link
Contributor

Thanks @qiweizhen for reviewing,
to my knowledge Huggingface Transformers use -100 to indicate ignored tokens during loss calculations, also I wanted to ask if reduction strategy is important (Transformers use reduction='mean' to my knowledge, but hear it is set to 'sum') because for me it's not?
I also noticed that I haven't changed this behaviour in another ProphetNet model, I will examine if it's necessary and commit changes, also I will write some tests to check for this behaviour in nearest future.

Thank you for pointing out this "mean" or "sum" problem! This line of code is converted from Fairseq version ProphetNet, which use loss sum here, to be consistent with [Fairseq Transformer] (https://github.com/pytorch/fairseq/blob/v0.9.0/fairseq/criterions/label_smoothed_cross_entropy.py#L26-L27). The reason is that in the training pipeline of Fairseq, they will do the "mean" operation in their trainer. So we return the sum loss and sample_size for Fairseq to calculate sum loss / sample_size (mean).
So I agree here we should use "mean" as you suggested. Thank you @Zhylkaaa !

@Zhylkaaa
Copy link
Contributor Author

Hi @qiweizhen, I want to verify that I should mean label smoothing loss instead of summing it to be consistent with change of reduction strategy and also should I change non_pad_mask to mask where I exclude -100? (You can see this changes in last commit but I just want to be sure 😁)

Also @patrickvonplaten, I have messed up with rebasing so I needed to make reset hard, is it ok or should I close this PR and open one that doesn't change commit history when I finish?)

Zhylkaaa added a commit to Zhylkaaa/simpletransformers that referenced this pull request Nov 17, 2020
@julien-c julien-c added the model card Related to pretrained model cards label Nov 19, 2020
@patrickvonplaten patrickvonplaten removed the model card Related to pretrained model cards label Nov 19, 2020
@patrickvonplaten
Copy link
Contributor

Great PR @Zhylkaaa! I did a small refactor and fixed the test. Thanks for your help @qiweizhen

@patrickvonplaten patrickvonplaten merged commit ca0109b into huggingface:master Nov 19, 2020
moussaKam pushed a commit to moussaKam/transformers that referenced this pull request Nov 20, 2020
* `disable_ngram_loss` fix for prophetnet

* add changes documentation

* fix _compute_loss to use mean reduction and -100 to masked tokens & remove unnecessary arguments

* mean label smoothing loss

* small refactor

* fix test

Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

disable_ngram_loss doesn't work correctly in ProphetNetForConditionalGeneration
4 participants