MPNet: Masked and Permuted Pre-training for Language Understanding #8971

StillKeepTry · 2020-12-07T19:02:47Z

Model addition

Model description

MPNet introduces a novel self-supervised objective named masked and permuted language modeling for language understanding. It inherits the advantages of both the masked language modeling (MLM) and the permuted language modeling (PLM) to addresses the limitations of MLM/PLM, and further reduce the inconsistency between the pre-training and fine-tuning paradigms.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

* refactor * further refactor * fix the rest tomorrow * save intermediate * finish slow tokenizer * make more tests pass * finish refactor * fix comment * clean further * fix name * fix naming * Update src/transformers/models/reformer/tokenization_reformer.py * Apply suggestions from code review * Apply suggestions from code review * refactor * fix init tokenizers * refactor * improve convert * refactor * correct convert slow tokenizer * final fix for Pegasus Tok * remove ipdb * improve links

* Fix minor typos * Additional typos * Style fix Co-authored-by: guyrosin <guyrosin@assist-561.cs.technion.ac.il>

* implement job skipping for doc-only PRs * silent grep is crucial * wip * wip * wip * wip * wip * wip * wip * wip * let's add doc * let's add code * revert test commits * restore * Better name * Better name * Better name * some more testing * some more testing * some more testing * finish testing

* Migration guide from v3.x to v4.x * Better wording * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Sylvain's comments * Better wording. Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add T5 Encoder class for feature extraction * fix T5 encoder add_start_docstrings indent * update init with T5 encoder * update init with TFT5ModelEncoder * remove TFT5ModelEncoder * change T5ModelEncoder order in init * add T5ModelEncoder to transformers init * clean T5ModelEncoder * update init with TFT5ModelEncoder * add TFModelEncoder for Tensorflow * update init with TFT5ModelEncoder * Update src/transformers/models/t5/modeling_t5.py change output from Seq2SeqModelOutput to BaseModelOutput Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * remove encoder_outputs 1. remove encoder_outputs from the function call. 2. remove the encoder_outputs If statement. 3. remove isinstance from return_dict. * Authorize missing decoder keys * remove unnecessary input parameters remove pask_key_values and use_cache * remove use_cache remove use_cache from the forward method * add doctoring for T5 encoder add doctoring for T5 encoder with T5_ENCODER_INPUTS_DOCSTRING * change return_dict to dot access * add T5_ENCODER_INPUTS_DOCSTRING for TF T5 * change TFT5Encoder output type to BaseModelOutput * remove unnecessary parameters for TFT5Encoder * remove unnecessary if statement * add import BaseModelOutput * fix BaseModelOutput typo to TFBaseModelOutput * update T5 doc with T5ModelEncoder * add T5ModelEncoder to tests * finish pytorch * finish docs and mt5 * add mtf to init * fix init * remove n_positions * finish PR * Update src/transformers/models/mt5/modeling_mt5.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/t5/modeling_t5.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/t5/modeling_tf_t5.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/mt5/modeling_tf_mt5.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * make style Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

… is_world_master() (huggingface#8828)

Related issue: huggingface#8837

* Use model.from_pretrained for DataParallel also When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end. This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel. If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden? * Fix silly bug * Address review comments Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?

* Remove deprecated `evalutate_during_training` * Update src/transformers/training_args_tf.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Slightly increase tolerance between pytorch and flax output Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * test_multiple_sentences doesn't require torch Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Simplify parameterization on "jit" to use boolean rather than str Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Use `require_torch` on `test_multiple_sentences` because we pull the weight from the hub. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Rename "jit" parameter to "use_jit" for (hopefully) making it self-documenting. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Remove pytest.mark.parametrize which seems to fail in some circumstances Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Fix unused imports. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Fix style. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Give default parameters values for traced model. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Review comment: Change sentences to sequences Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…ngface#8781) * NerPipeline (TokenClassification) now outputs offsets of words - It happens that the offsets are missing, it forces the user to pattern match the "word" from his input, which is not always feasible. For instance if a sentence contains the same word twice, then there is no way to know which is which. - This PR proposes to fix that by outputting 2 new keys for this pipelines outputs, "start" and "end", which correspond to the string offsets of the word. That means that we should always have the invariant: ```python input[entity["start"]: entity["end"]] == entity["entity_group"] # or entity["entity"] if not grouped ``` * Fixing doc style

* fix DP case on multi-gpu * make executable * test all 3 modes * use the correct check for distributed * dp doesn't need a special case * restore original name * cleanup

* add CTRLForSequenceClassification * pass local test * merge with master * fix modeling test for sequence classification * fix deco * fix assert

* 2 typos - from_question_encoder_generator_configs fix 2 typos from_encoder_generator_configs --> from_question_encoder_generator_configs * apply make style

)

…it contains. Fixes huggingface#6582. (huggingface#8860) Update src/transformers/tokenization_utils_base.py with review fix Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

…cePiece (huggingface#8881)

* restore skip * Revert "Remove deprecated `evalutate_during_training` (huggingface#8852)" This reverts commit 5530299. * check that pipeline.git.base_revision is defined before proceeding * Revert "Revert "Remove deprecated `evalutate_during_training` (huggingface#8852)"" This reverts commit dfec84d. * check that pipeline.git.base_revision is defined before proceeding * doc only * doc + code * restore * restore * typo

* Add a `distributed_env` property to TrainingArguments * Change name * Address comment

reference: huggingface#8853 (comment)

sgugger

Impressive work, thanks a lot! There are few last adjustments to make and then it will be good to merge!
Great work on this model!

sgugger · 2020-12-07T22:20:45Z

src/transformers/models/mpnet/modeling_mpnet.py

+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias


We were just saying today with @LysandreJik that this is a bad design as it causes multiple problems with the equivalent PT/TF models afterwards. The variable self.bias is not needed nor used. So L658 to L661 should be deleted.

@sgugger I directly remove L658 to L 661, it seems that cannot pass test_pt_tf_model_equivalence since self.decoder.bias will be a random initialization. Maybe I can merge L658 to L661 as:

self.decoder.bias = nn.Parameter(torch.zeros(config.vocab_size)), what is your opinion?

@sgugger It seems only using the original style (like L658 to L661) can pass tests.

Ok, then let's forget about this for now, we'll try to fix this on all models in another PR :-)

src/transformers/models/mpnet/tokenization_mpnet.py

sgugger · 2020-12-07T22:31:14Z

utils/check_repo.py

@@ -37,6 +37,7 @@
    "TFDPRSpanPredictor",  # Building part of bigger (tested) model.
    "TFElectraMainLayer",  # Building part of bigger (tested) model (should it be a TFPreTrainedModel ?)
    "TFRobertaForMultipleChoice",  # TODO: fix
+    "TFMPNetForMultipleChoice",  # TODO: fix


Do we have someone working on this?

@sgugger I add one line to address this problem. But I am not sure its style is ok in the huggingface implementation.

https://github.com/StillKeepTry/transformers/blob/78dcc71fd96ec6a04739a422f38bab0532e42a8d/src/transformers/models/mpnet/modeling_tf_mpnet.py#L175

This problem seems is because the position ids from fairseq are different from others (which is started from 2). Since our model is also implemented in fairseq, which will report such error (same as roberta).

@sgugger Besides, do you have any other comments.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

jplu

Great work!! I just left few tiny comments to address before to merge.

src/transformers/models/mpnet/modeling_tf_mpnet.py

Co-authored-by: Julien Plu <plu.julien@gmail.com>

StillKeepTry · 2020-12-08T13:12:04Z

@jplu I have fixed your comments now.

jplu · 2020-12-08T13:21:17Z

Thanks!!

@sgugger @patrickvonplaten @LysandreJik I'm seeing that the TFMPNetForPreTraining and MPNetForPreTraining are missing from the TF and PT file. Should they be added? Otherwise it is fine for me :)

StillKeepTry · 2020-12-09T06:54:43Z

Thanks!!

@sgugger @patrickvonplaten @LysandreJik I'm seeing that the TFMPNetForPreTraining and MPNetForPreTraining are missing from the TF and PT file. Should they be added? Otherwise it is fine for me :)

I observe that some models also lack TFXXXForPreTraining and XXXForPreTraining. I am willing to add them in the next stage.

patrickvonplaten · 2020-12-09T10:36:07Z

Hey @StillKeepTry,

we are super sorry, we had a problem yesterday with git and this is why your git history is cluttered with wrong commits earlier. I cleaned your PR and pushed it to a new branch on master here: #9004 .
It should include all the commits you had earlier. I think we all gave our thumbs-up, so we could merge the other pull request to master (which would require the least amount of work from your side).

However if you want to be the main author of the PR (which is 100% understandable and which is what I would want!), can you do the following steps to open a new clean PR which was exactly like before:

In your repo (https://github.com/StillKeepTry/transformers), assuming that the remote to the original hugging face repo (https://github.com/huggingface/transformers.git) is called upstream:

$ git fetch upstream
$ git checkout upstream/master
$ git checkout -b add_mp_net_new
# now we'll cherry pick all of your commits
$ git cherry-pick 7361516^..78dcc71
$ git push
# => now you should be able to open new PR with exactly the commits you had previously

Lemme know if you need help doing this (or if you don't mind merging #9004 - but it would be fairer to you if you're also officially the main author!).

Big sorry again!

StillKeepTry · 2020-12-09T13:11:44Z

@patrickvonplaten Never mind, just use your PR. I am ok if our work can be merged into the master quickly.

gaceladri · 2021-05-24T11:22:39Z

Hello! I can not see the data collator for permuted and masked language models. Was it added also inside HuggingFace? There is an already proposed way to do this collator inside the trainer?

Thanks!

LysandreJik · 2021-05-25T08:41:35Z

@gaceladri we have an example for permutation language modeling, check it out here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_plm.py

gaceladri · 2021-05-25T09:57:43Z

Hi @LysandreJik, thank you for your kind response. This data collator that you pointed me out, is the collator from permuted language model used in XLNet right? I am unsure that this is a collator to replicate MPNet that mask tokens, not indices and also do the permutation. Sure that I am misunderstanding something...

patrickvonplaten and others added 30 commits November 28, 2020 19:50

fix mt5 config (huggingface#8832)

36b60ce

Minor docs typo fixes (huggingface#8797)

3a08cc1

* Fix minor typos * Additional typos * Style fix Co-authored-by: guyrosin <guyrosin@assist-561.cs.technion.ac.il>

token-classification: use is_world_process_zero instead of deprecated…

19fa01c

… is_world_master() (huggingface#8828)

Correct docstring. (huggingface#8845)

cc983cd

Related issue: huggingface#8837

Add a direct link to the big table (huggingface#8850)

75f8100

Comment the skip job on doc line

08e7076

Merge remote-tracking branch 'origin/master'

4062c75

Remove deprecated evalutate_during_training (huggingface#8852)

5530299

* Remove deprecated `evalutate_during_training` * Update src/transformers/training_args_tf.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Release: v4.0.0

22b0ff7

Update docs

9995a34

fix pypi complaint on version naming

5fd3d81

[s2s trainer] fix DP mode (huggingface#8823)

7f34d75

* fix DP case on multi-gpu * make executable * test all 3 modes * use the correct check for distributed * dp doesn't need a special case * restore original name * cleanup

Ctrl for sequence classification (huggingface#8812)

4a9e502

* add CTRLForSequenceClassification * pass local test * merge with master * fix modeling test for sequence classification * fix deco * fix assert

Fix doc for language code (huggingface#8848)

814b955

2 typos in modeling_rag.py (huggingface#8676)

d366228

* 2 typos - from_question_encoder_generator_configs fix 2 typos from_encoder_generator_configs --> from_question_encoder_generator_configs * apply make style

Make the big table creation/check platform independent (huggingface#8856

c0df963

)

Better warning when loading a tokenizer with AutoTokenizer w/o Sneten…

a947386

…cePiece (huggingface#8881)

Better support for resuming training (huggingface#8878)

7c10dd2

Add a parallel_mode property to TrainingArguments (huggingface#8877)

b08843c

* Add a `distributed_env` property to TrainingArguments * Change name * Address comment

start using training_args.parallel_mode (huggingface#8882)

379005c

disable job skip - need more work

693ac35

reference: huggingface#8853 (comment)

sgugger approved these changes Dec 7, 2020

View reviewed changes

StillKeepTry and others added 8 commits December 8, 2020 12:48

Update src/transformers/models/mpnet/tokenization_mpnet.py

7e37bea

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

d5a63eb

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

b382487

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

62b4a59

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

c16eb64

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

031272e

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

2afce6c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/models/mpnet/tokenization_mpnet.py

9fb6d4f

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

StillKeepTry force-pushed the master branch from ed9676a to 9fb6d4f Compare December 8, 2020 07:13

jplu approved these changes Dec 8, 2020

View reviewed changes

StillKeepTry and others added 7 commits December 8, 2020 20:53

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

73d2878

Co-authored-by: Julien Plu <plu.julien@gmail.com>

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

22d0c18

Co-authored-by: Julien Plu <plu.julien@gmail.com>

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

27eb4f0

Co-authored-by: Julien Plu <plu.julien@gmail.com>

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

cb0747b

Co-authored-by: Julien Plu <plu.julien@gmail.com>

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

fb7073d

Co-authored-by: Julien Plu <plu.julien@gmail.com>

Update src/transformers/models/mpnet/modeling_tf_mpnet.py

3b9d0f1

Co-authored-by: Julien Plu <plu.julien@gmail.com>

fix inputs handling in tf

4f7ec90

fix positions ids bug

78dcc71

mfuntowicz force-pushed the master branch from 447808c to 18c32ee Compare December 8, 2020 22:38

patrickvonplaten mentioned this pull request Dec 9, 2020

Add MP Net 2 #9004

Closed

5 tasks

patrickvonplaten closed this Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPNet: Masked and Permuted Pre-training for Language Understanding #8971

MPNet: Masked and Permuted Pre-training for Language Understanding #8971

StillKeepTry commented Dec 7, 2020

sgugger left a comment

sgugger Dec 7, 2020

StillKeepTry Dec 8, 2020 •

edited

StillKeepTry Dec 8, 2020

sgugger Dec 8, 2020

sgugger Dec 7, 2020

StillKeepTry Dec 8, 2020

StillKeepTry Dec 8, 2020

jplu left a comment

StillKeepTry commented Dec 8, 2020

jplu commented Dec 8, 2020

StillKeepTry commented Dec 9, 2020

patrickvonplaten commented Dec 9, 2020

StillKeepTry commented Dec 9, 2020

gaceladri commented May 24, 2021

LysandreJik commented May 25, 2021

gaceladri commented May 25, 2021

MPNet: Masked and Permuted Pre-training for Language Understanding #8971

MPNet: Masked and Permuted Pre-training for Language Understanding #8971

Conversation

StillKeepTry commented Dec 7, 2020

Model addition

Model description

What does this PR do?

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

sgugger Dec 7, 2020

Choose a reason for hiding this comment

StillKeepTry Dec 8, 2020 • edited

Choose a reason for hiding this comment

StillKeepTry Dec 8, 2020

Choose a reason for hiding this comment

sgugger Dec 8, 2020

Choose a reason for hiding this comment

sgugger Dec 7, 2020

Choose a reason for hiding this comment

StillKeepTry Dec 8, 2020

Choose a reason for hiding this comment

StillKeepTry Dec 8, 2020

Choose a reason for hiding this comment

jplu left a comment

Choose a reason for hiding this comment

StillKeepTry commented Dec 8, 2020

jplu commented Dec 8, 2020

StillKeepTry commented Dec 9, 2020

patrickvonplaten commented Dec 9, 2020

StillKeepTry commented Dec 9, 2020

gaceladri commented May 24, 2021

LysandreJik commented May 25, 2021

gaceladri commented May 25, 2021

StillKeepTry Dec 8, 2020 •

edited