Add barthez model #8393

moussaKam · 2020-11-07T23:40:56Z

What does this PR do?

Add BARThez models, tokenizer and docs. BARThez is a french seq2seq model that uses BART objective and architecture (https://arxiv.org/abs/2010.12321)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to the it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

@patrickvonplaten @sshleifer

sshleifer

LGTM, you might want to add an integration test for summarization!

sshleifer · 2020-11-09T13:24:58Z

tests/test_modeling_barthez.py

+@require_tokenizers
+class BarthezModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_output_embeds_base_model(self):


Do you want to add an integration test for summarization/generate?

tests/test_tokenization_barthez.py

sshleifer · 2020-11-09T15:31:09Z

You might also test that the config is identical to BartConfig.

moussaKam · 2020-11-10T00:19:34Z

Thank you @sshleifer for your review, I added some additional integration tests.

sgugger

Thanks for adding the model, it looks very cool! I mostly have nit-picking comments for the documentation.

I'm a bit surprised by the complexity of the tokenizer however. Why is there a sentencepiece model and a different vocab file which makes you have to override private methods instead of using the tokenizer API (and probably prevents a fast tokenizer)? Can't we just change the spm file to have the proper vocabulary?

docs/source/model_doc/barthez.rst

docs/source/pretrained_models.rst

src/transformers/configuration_auto.py

src/transformers/configuration_barthez.py

src/transformers/tokenization_barthez.py

moussaKam · 2020-11-10T17:46:35Z

Hi @sgugger, thank you for your review, I added all the proposed changes.

As for the tokenizer, the reason for having a vocab file, is that mBarthez uses the same sentencepiece tokenizer as mBart, while discarding tokens with non-Latin characters from the embedding layers. So basically the token-ids mapping has changed. I am not sure if it is possible to change the sentencepiece model itself. Anyway I think we can keep it like that for the moment.

Please let me know if you would like to recommend any other changes.

patrickvonplaten · 2020-11-11T09:06:00Z

src/transformers/tokenization_barthez.py

+
+VOCAB_FILES_NAMES = {"sentence_piece_model": "sentencepiece.bpe.model", "vocab_file": "vocab.json"}
+
+PRETRAINED_VOCAB_FILES_MAP = {


(Does not have to changed in this PR) @julien-c I think we don't need this PRETRAINED_VOCAB_FILES_MAP actually anymore at all no? The better way would be to just move all tokenizer files to the respective folders I think

Both @LysandreJik and @thomwolf said they'll take a look :)

patrickvonplaten

Awesome! Good to merge IMO.

julien-c · 2020-11-11T11:37:45Z

src/transformers/configuration_barthez.py

+logger = logging.get_logger(__name__)
+
+BARTHEZ_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json",


Suggested change

"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json",

"moussaKam/barthez": "https://huggingface.co/moussaKam/barthez/resolve/main/config.json",

Update URL to current scheme (even though those urls shouldn't be used anymore...)

Done in 0a77338

LysandreJik

This looks good, thanks for your contribution! I don't think the models need to be defined, however. They're basically a complete copy of the BART model, with no changes done to the configuration either. The main difference is the tokenizer.

Since #6995, tokenizers can be decoupled from their models, which would be ideal for this PR.

Instead of redefining all barthez models which are basically inheriting from BART, you would simply load the model checkpoint in the BART architecture. However, there would still be a BarthezTokenizer as this requires additional code.

You can take inspiration from the Phobert model. See how its configuration file mentions that it is based on a RoBERTa implementation, but still leverages a PhobertTokenizer . You can see its implementation details here.

You can also take inspiration from Herbert here, which leveraged #6995 as well.

LysandreJik · 2020-11-11T15:07:33Z

src/transformers/configuration_barthez.py

+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.


Suggested change

:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.

:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.

They're the same, but SILU is older :) see #8100 for more information

moussaKam · 2020-11-11T21:21:56Z

Hi @LysandreJik, thank you for the review. Yes you're right, actually I was hesitating whether to redefine barthez models or not. Anyway I modified the code as requested. Hope it's ok now.

moussaKam · 2020-11-18T21:40:06Z

@LysandreJik @julien-c Please let me know if any other modifications are required. Thank you in advance :)

LysandreJik · 2020-11-19T19:18:34Z

Hi @moussaKam! Sorry about getting back to you so late. The issue with this PR is with the implementation of the tokenizer and its fast tokenizer counterpart. Right now there is no BarthezTokenizerFast, which we would really need as SentencePiece is not installed by default anymore.

It seems that the BarthezTokenizer is very similar to the XLMRobertaTokenizer, so I'm trying to see if we can't do a similar conversion between your tokenizer and the resulting tokenizer. It seems completely feasible.

Do you think you could take a look at the convert_slow_tokenizers_checkpoints_to_fast.py module, and at the XLMRobertaConverter object available in the convert_slow_tokenizer.py module to see if such a conversion if possible? Thank you.

BARThez is a pre-trained french seq2seq model that uses BART objective.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

moussaKam · 2020-11-20T17:40:25Z

Hi @LysandreJik, I added the fast tokenizer, thank you for the tip! Please let me know if we're good now! :)

LysandreJik · 2020-11-23T16:31:37Z

tests/test_tokenization_barthez.py

+class BarthezTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = BarthezTokenizer
+    test_rust_tokenizer = False


Thanks for implementing the fast tokenizer! It would be nice to add it to the tests.

Could you switch that to True and test the fast tokenizer as well? Thanks!

Thank you @LysandreJik, I switched to the fast tokenization test. However I noticed that if the sentencepiece model uses bpe tokenization instead of unigram (model_type = 2)

transformers/src/transformers/convert_slow_tokenizer.py

Line 309 in 367f497

elif model_type == 2:

)
the conversion to fast tokenizer will be slow, and the integration test will be slow as well. In my case BARThez model uses type 2, do you think it is possible to save the tokenizer.json when calling tokenizer.save_pretrained() to avoid that the model performs the conversion when loading the tokenizer from a saved directory?

Indeed, that takes a while. Let me have a look.

You can use legacy_format=False to save the tokenizer.json file directly, but I feel like this should be automatically done for the fast tokenizers. Looking into it.

moussaKam · 2020-11-26T11:11:17Z

Hi @LysandreJik, do you think we still need any changes?

LysandreJik · 2020-11-27T17:31:38Z

I think we can merge it as it is right now, and use the legacy_format=False when saving the slow tokenizer. We're thinking of a way to enable this by default for fast tokenizers, but this isn't blocking for this PR. Thanks!

* Add init barthez * Add barthez model, tokenizer and docs BARThez is a pre-trained french seq2seq model that uses BART objective. * Apply suggestions from code review docs typos Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add license * Change URLs scheme * Remove barthez model keep tokenizer * Fix style * Fix quality * Update tokenizer * Add fast tokenizer * Add fast tokenizer test Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

sshleifer approved these changes Nov 9, 2020

View reviewed changes

sshleifer requested review from sgugger, LysandreJik and patrickvonplaten November 9, 2020 13:27

moussaKam force-pushed the add-barthez-model branch from 1596c0f to a74deae Compare November 10, 2020 00:08

sshleifer approved these changes Nov 10, 2020

View reviewed changes

sgugger reviewed Nov 10, 2020

View reviewed changes

moussaKam force-pushed the add-barthez-model branch from b3d8264 to 42dd96e Compare November 10, 2020 17:36

moussaKam requested a review from sgugger November 10, 2020 17:46

patrickvonplaten reviewed Nov 11, 2020

View reviewed changes

patrickvonplaten approved these changes Nov 11, 2020

View reviewed changes

julien-c reviewed Nov 11, 2020

View reviewed changes

moussaKam force-pushed the add-barthez-model branch from 42dd96e to 0a77338 Compare November 11, 2020 15:30

LysandreJik reviewed Nov 11, 2020

View reviewed changes

moussaKam force-pushed the add-barthez-model branch from 0a77338 to a7c051b Compare November 11, 2020 20:26

moussaKam requested a review from LysandreJik November 13, 2020 07:44

moussaKam force-pushed the add-barthez-model branch 2 times, most recently from 946ef4d to daa2243 Compare November 18, 2020 21:29

moussaKam force-pushed the add-barthez-model branch 2 times, most recently from 9a872cd to 3fa5d0e Compare November 20, 2020 17:17

moussaKam and others added 4 commits November 20, 2020 18:31

Add init barthez

24a291d

Add barthez model, tokenizer and docs

8fb9719

BARThez is a pre-trained french seq2seq model that uses BART objective.

Apply suggestions from code review docs typos

e1d8cb8

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Add license

7884576

moussaKam added 6 commits November 20, 2020 18:31

Change URLs scheme

76a19cb

Remove barthez model keep tokenizer

0f4968d

Fix style

570b5d0

Fix quality

449aa07

Update tokenizer

be19b82

Add fast tokenizer

d73bfd8

moussaKam force-pushed the add-barthez-model branch from 3fa5d0e to d73bfd8 Compare November 20, 2020 17:33

LysandreJik reviewed Nov 23, 2020

View reviewed changes

Add fast tokenizer test

feadc03

LysandreJik merged commit 81fe0bf into huggingface:master Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add barthez model #8393

Add barthez model #8393

moussaKam commented Nov 7, 2020 •

edited by sshleifer

sshleifer left a comment

sshleifer Nov 9, 2020

sshleifer commented Nov 9, 2020

moussaKam commented Nov 10, 2020

sgugger left a comment

moussaKam commented Nov 10, 2020

patrickvonplaten Nov 11, 2020 •

edited

julien-c Nov 11, 2020

patrickvonplaten left a comment •

edited

julien-c Nov 11, 2020

julien-c Nov 11, 2020

moussaKam Nov 11, 2020

LysandreJik left a comment

LysandreJik Nov 11, 2020

moussaKam commented Nov 11, 2020

moussaKam commented Nov 18, 2020

LysandreJik commented Nov 19, 2020

moussaKam commented Nov 20, 2020

LysandreJik Nov 23, 2020

moussaKam Nov 23, 2020

LysandreJik Nov 24, 2020

LysandreJik Nov 24, 2020

moussaKam commented Nov 26, 2020

LysandreJik commented Nov 27, 2020


		VOCAB_FILES_NAMES = {"sentence_piece_model": "sentencepiece.bpe.model", "vocab_file": "vocab.json"}

		PRETRAINED_VOCAB_FILES_MAP = {

	"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json",
	"moussaKam/barthez": "https://huggingface.co/moussaKam/barthez/resolve/main/config.json",

	:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
	:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.

Add barthez model #8393

Add barthez model #8393

Conversation

moussaKam commented Nov 7, 2020 • edited by sshleifer

What does this PR do?

Before submitting

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer commented Nov 9, 2020

moussaKam commented Nov 10, 2020

sgugger left a comment

Choose a reason for hiding this comment

moussaKam commented Nov 10, 2020

patrickvonplaten Nov 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moussaKam commented Nov 11, 2020

moussaKam commented Nov 18, 2020

LysandreJik commented Nov 19, 2020

moussaKam commented Nov 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moussaKam commented Nov 26, 2020

LysandreJik commented Nov 27, 2020

moussaKam commented Nov 7, 2020 •

edited by sshleifer

patrickvonplaten Nov 11, 2020 •

edited

patrickvonplaten left a comment •

edited