Make RobertaForMaskedLM implementation identical to fairseq #2928

BramVanroy · 2020-02-20T12:55:31Z

The implementation of RoBERTa in transformers differs from the original implementation in fairseq, as results showed (cf. #1874). I have documented my findings here #1874 (comment) and made the corresponding changes accordingly in this PR.

Someone should check, however, that removing get_output_embeddings() does not have any adverse side-effects.

In addition, someone who is knowledgeable about Tensorflow should check the TF implementation of RoBERTa, too.

BramVanroy · 2020-02-20T12:55:42Z

TODO: #2913 (comment)

codecov-io · 2020-02-20T13:02:14Z

Codecov Report

Merging #2928 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #2928      +/-   ##
=========================================
- Coverage    75.3%   75.3%   -0.01%     
=========================================
  Files          94      94              
  Lines       15424   15423       -1     
=========================================
- Hits        11615   11614       -1     
  Misses       3809    3809

Impacted Files	Coverage Δ
src/transformers/modeling_roberta.py	`95.75% <100%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59c23ad...1f290e5. Read the comment docs.

joeddav · 2020-02-20T14:24:36Z

Looks good. I tested it out and the outputs match exactly everywhere I can see. Requested review from @LysandreJik as well.

Regarding the test mentioned by @sshleifer, you can just test that a slice of the outputs match rather than the entire tensor. See here for an example.

BramVanroy · 2020-02-20T14:44:23Z

Looks good. I tested it out and the outputs match exactly everywhere I can see. Requested review from @LysandreJik as well.

Regarding the test mentioned by @sshleifer, you can just test that a slice of the outputs match rather than the entire tensor. See here for an example.

Thanks, will add tests later. I am still a bit confused why the weights of the embeddings are tied to the LMHead in the original implementation, though. I don't quite get the intention there.

BramVanroy · 2020-02-20T15:02:04Z

Hm, perhaps this warning message should not be there.

Weights of RobertaForMaskedLM not initialized from pretrained model: ['lm_head.weight']
Weights from pretrained model not used in RobertaForMaskedLM: ['lm_head.decoder.weight']

lm_head.weight is initialised because it takes the embedding weights
the weights from the pretrained model are not used because they are not required

joeddav · 2020-02-20T15:15:54Z

@BramVanroy Where are you getting that warning? I don't see it when I call RobertaForMaskedLM.from_pretrained

BramVanroy · 2020-02-20T15:19:44Z

You can only see it if your logging level is set to INFO or lower. So you can put the following before loading the model.

import logging
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO)

joeddav · 2020-02-20T16:06:21Z

Oh I see. Looks like the problem is just that the weight param introduced has a different name format than before. Rather than using the functional API as you did here, I would just manually override decoder.weight when weight is passed. I.e.,

self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
if weight is not None:
    self.decoder.weight = weight

As you mentioned, it's not a huge issue since the weights are correctly loaded from the embeddings anyway, but probably a bit cleaner if the names align.

Re-added decoder name to avoid getting warning messages. In practice, this does not change anything about the model

…irseq test_lm_inference_identical_to_fairseq compares the output of HuggingFace RoBERTa to a slice of the output tensor of the (original) fairseq RoBERTa

BramVanroy · 2020-02-21T14:59:20Z

For those interested, I found the answer to the why on Twitter because of a helpful comment. Apparently this is common practice and has been introduced a while back in Using the output of embeddings to improve language models.

LysandreJik

Hi @BramVanroy! I can see there's an issue here but I don't think this is the way to solve it.

We actually do tie the weights together, so there's no need to do any additional tying. We actually tie the weights for every model that has an LM head (Masked or causal).

The issue here is because of the bias I introduced a few weeks ago with #2521. The way I did it means that the bias was actually applied twice.

The correct way to fix it would be to change

        x = self.decoder(x) + self.bias

to

        x = self.decoder(x)

in the forward method. The bias is already part of the decoder, so no need to apply it once more.

Do you want to update your PR, or should I do one to fix it?

BramVanroy · 2020-02-21T15:58:30Z

Hi @BramVanroy! I can see there's an issue here but I don't think this is the way to solve it.

We actually do tie the weights together, so there's no need to do any additional tying. We actually tie the weights for every model that has an LM head (Masked or causal).

The issue here is because of the bias I introduced a few weeks ago with #2521. The way I did it means that the bias was actually applied twice.

The correct way to fix it would be to change
        x = self.decoder(x) + self.bias
to
        x = self.decoder(x)
in the forward method. The bias is already part of the decoder, so no need to apply it once more.

Do you want to update your PR, or should I do one to fix it?

Aha, my bad. I thought I finally contributed something useful! 😳 You can add a PR, I'll close this one. (Perhaps the updated test is still useful so that something like this doesn't happen in the future.)

Can you link to the lines where the weight tying is happening, though? I must have completely missed it.

LysandreJik · 2020-02-21T18:44:14Z

Your contributions are more than useful, @BramVanroy, and I'm glad you tried to fix an issue when you discovered one, thank you.

To answer your question, the PreTrainedModel abstract class has an init_weights method which ties the input embeddings to the output embeddings.

This method is not directly called by any model class, but it is called by the init_weights method of that same abstract class.

It is this last method that is called by every model during their instantiation, for example with RobertaModel.

This is only the PyTorch way though, the TensorFlow way is different. In TensorFlow, we use a single layer that can be called as an embedding or a linear layer, as you may see in the BertEmbeddings class. Please note the mode flag which makes possible the choice between the layers.

Bram Vanroy added 2 commits February 19, 2020 21:24

make RoBERTa implementation identical to fairseq

f65773e

make style && make quality

51514a4

BramVanroy added the Core: Modeling Internals of the library; Models. label Feb 20, 2020

BramVanroy assigned BramVanroy and unassigned BramVanroy Feb 20, 2020

joeddav requested a review from LysandreJik February 20, 2020 14:07

BramVanroy mentioned this pull request Feb 20, 2020

Disparitry with Fairseq Roberta implementation for predicting the mask token #1874

Closed

Bram Vanroy added 3 commits February 20, 2020 21:04

re-add decoder

fe5f0ef

Re-added decoder name to avoid getting warning messages. In practice, this does not change anything about the model

Replace test_inference_masked_lm by test_lm_inference_identical_to_fa…

7294d1e

…irseq test_lm_inference_identical_to_fairseq compares the output of HuggingFace RoBERTa to a slice of the output tensor of the (original) fairseq RoBERTa

make style && make quality

1f290e5

LysandreJik requested changes Feb 21, 2020

View reviewed changes

BramVanroy closed this Feb 21, 2020

BramVanroy mentioned this pull request Feb 21, 2020

Bias in BertLMPredictionHead is added twice #2957

Closed

BramVanroy deleted the roberta-fix branch February 25, 2020 09:12

amarasovic mentioned this pull request Apr 21, 2020

Add readme for the mlm study allenai/dont-stop-pretraining#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make RobertaForMaskedLM implementation identical to fairseq #2928

Make RobertaForMaskedLM implementation identical to fairseq #2928

BramVanroy commented Feb 20, 2020 •

edited

Loading

BramVanroy commented Feb 20, 2020

codecov-io commented Feb 20, 2020 •

edited

Loading

joeddav commented Feb 20, 2020

BramVanroy commented Feb 20, 2020

BramVanroy commented Feb 20, 2020 •

edited

Loading

joeddav commented Feb 20, 2020 •

edited

Loading

BramVanroy commented Feb 20, 2020

joeddav commented Feb 20, 2020 •

edited

Loading

BramVanroy commented Feb 21, 2020

LysandreJik left a comment

BramVanroy commented Feb 21, 2020

LysandreJik commented Feb 21, 2020

Make RobertaForMaskedLM implementation identical to fairseq #2928

Make RobertaForMaskedLM implementation identical to fairseq #2928

Conversation

BramVanroy commented Feb 20, 2020 • edited Loading

BramVanroy commented Feb 20, 2020

codecov-io commented Feb 20, 2020 • edited Loading

Codecov Report

joeddav commented Feb 20, 2020

BramVanroy commented Feb 20, 2020

BramVanroy commented Feb 20, 2020 • edited Loading

joeddav commented Feb 20, 2020 • edited Loading

BramVanroy commented Feb 20, 2020

joeddav commented Feb 20, 2020 • edited Loading

BramVanroy commented Feb 21, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

BramVanroy commented Feb 21, 2020

LysandreJik commented Feb 21, 2020

BramVanroy commented Feb 20, 2020 •

edited

Loading

codecov-io commented Feb 20, 2020 •

edited

Loading

BramVanroy commented Feb 20, 2020 •

edited

Loading

joeddav commented Feb 20, 2020 •

edited

Loading

joeddav commented Feb 20, 2020 •

edited

Loading