Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add barthez model #8393

Merged
merged 11 commits into from
Nov 27, 2020
Merged

Add barthez model #8393

merged 11 commits into from
Nov 27, 2020

Conversation

moussaKam
Copy link
Contributor

@moussaKam moussaKam commented Nov 7, 2020

What does this PR do?

Add BARThez models, tokenizer and docs. BARThez is a french seq2seq model that uses BART objective and architecture (https://arxiv.org/abs/2010.12321)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@patrickvonplaten @sshleifer

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, you might want to add an integration test for summarization!

@require_tokenizers
class BarthezModelIntegrationTest(unittest.TestCase):
@slow
def test_output_embeds_base_model(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add an integration test for summarization/generate?

tests/test_tokenization_barthez.py Outdated Show resolved Hide resolved
@sshleifer
Copy link
Contributor

You might also test that the config is identical to BartConfig.

@moussaKam
Copy link
Contributor Author

Thank you @sshleifer for your review, I added some additional integration tests.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the model, it looks very cool! I mostly have nit-picking comments for the documentation.

I'm a bit surprised by the complexity of the tokenizer however. Why is there a sentencepiece model and a different vocab file which makes you have to override private methods instead of using the tokenizer API (and probably prevents a fast tokenizer)? Can't we just change the spm file to have the proper vocabulary?

docs/source/model_doc/barthez.rst Outdated Show resolved Hide resolved
docs/source/model_doc/barthez.rst Outdated Show resolved Hide resolved
docs/source/pretrained_models.rst Outdated Show resolved Hide resolved
src/transformers/configuration_auto.py Outdated Show resolved Hide resolved
src/transformers/configuration_barthez.py Outdated Show resolved Hide resolved
src/transformers/tokenization_barthez.py Outdated Show resolved Hide resolved
src/transformers/tokenization_barthez.py Outdated Show resolved Hide resolved
src/transformers/tokenization_barthez.py Outdated Show resolved Hide resolved
src/transformers/tokenization_barthez.py Outdated Show resolved Hide resolved
src/transformers/tokenization_barthez.py Outdated Show resolved Hide resolved
@moussaKam
Copy link
Contributor Author

Hi @sgugger, thank you for your review, I added all the proposed changes.

As for the tokenizer, the reason for having a vocab file, is that mBarthez uses the same sentencepiece tokenizer as mBart, while discarding tokens with non-Latin characters from the embedding layers. So basically the token-ids mapping has changed. I am not sure if it is possible to change the sentencepiece model itself. Anyway I think we can keep it like that for the moment.

Please let me know if you would like to recommend any other changes.


VOCAB_FILES_NAMES = {"sentence_piece_model": "sentencepiece.bpe.model", "vocab_file": "vocab.json"}

PRETRAINED_VOCAB_FILES_MAP = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Does not have to changed in this PR) @julien-c I think we don't need this PRETRAINED_VOCAB_FILES_MAP actually anymore at all no? The better way would be to just move all tokenizer files to the respective folders I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both @LysandreJik and @thomwolf said they'll take a look :)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Good to merge IMO.

logger = logging.get_logger(__name__)

BARTHEZ_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json",
"moussaKam/barthez": "https://huggingface.co/moussaKam/barthez/resolve/main/config.json",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update URL to current scheme (even though those urls shouldn't be used anymore...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0a77338

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for your contribution! I don't think the models need to be defined, however. They're basically a complete copy of the BART model, with no changes done to the configuration either. The main difference is the tokenizer.

Since #6995, tokenizers can be decoupled from their models, which would be ideal for this PR.

Instead of redefining all barthez models which are basically inheriting from BART, you would simply load the model checkpoint in the BART architecture. However, there would still be a BarthezTokenizer as this requires additional code.

You can take inspiration from the Phobert model. See how its configuration file mentions that it is based on a RoBERTa implementation, but still leverages a PhobertTokenizer . You can see its implementation details here.

You can also take inspiration from Herbert here, which leveraged #6995 as well.

Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.

They're the same, but SILU is older :) see #8100 for more information

@moussaKam
Copy link
Contributor Author

Hi @LysandreJik, thank you for the review. Yes you're right, actually I was hesitating whether to redefine barthez models or not. Anyway I modified the code as requested. Hope it's ok now.

@moussaKam moussaKam force-pushed the add-barthez-model branch 2 times, most recently from 946ef4d to daa2243 Compare November 18, 2020 21:29
@moussaKam
Copy link
Contributor Author

@LysandreJik @julien-c Please let me know if any other modifications are required. Thank you in advance :)

@LysandreJik
Copy link
Member

Hi @moussaKam! Sorry about getting back to you so late. The issue with this PR is with the implementation of the tokenizer and its fast tokenizer counterpart. Right now there is no BarthezTokenizerFast, which we would really need as SentencePiece is not installed by default anymore.

It seems that the BarthezTokenizer is very similar to the XLMRobertaTokenizer, so I'm trying to see if we can't do a similar conversion between your tokenizer and the resulting tokenizer. It seems completely feasible.

Do you think you could take a look at the convert_slow_tokenizers_checkpoints_to_fast.py module, and at the XLMRobertaConverter object available in the convert_slow_tokenizer.py module to see if such a conversion if possible? Thank you.

@moussaKam moussaKam force-pushed the add-barthez-model branch 2 times, most recently from 9a872cd to 3fa5d0e Compare November 20, 2020 17:17
moussaKam and others added 4 commits November 20, 2020 18:31
BARThez is a pre-trained french seq2seq model that uses BART objective.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@moussaKam
Copy link
Contributor Author

Hi @LysandreJik, I added the fast tokenizer, thank you for the tip! Please let me know if we're good now! :)

class BarthezTokenizationTest(TokenizerTesterMixin, unittest.TestCase):

tokenizer_class = BarthezTokenizer
test_rust_tokenizer = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the fast tokenizer! It would be nice to add it to the tests.

Could you switch that to True and test the fast tokenizer as well? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @LysandreJik, I switched to the fast tokenization test. However I noticed that if the sentencepiece model uses bpe tokenization instead of unigram (model_type = 2)

)
the conversion to fast tokenizer will be slow, and the integration test will be slow as well. In my case BARThez model uses type 2, do you think it is possible to save the tokenizer.json when calling tokenizer.save_pretrained() to avoid that the model performs the conversion when loading the tokenizer from a saved directory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that takes a while. Let me have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use legacy_format=False to save the tokenizer.json file directly, but I feel like this should be automatically done for the fast tokenizers. Looking into it.

@moussaKam
Copy link
Contributor Author

Hi @LysandreJik, do you think we still need any changes?

@LysandreJik
Copy link
Member

I think we can merge it as it is right now, and use the legacy_format=False when saving the slow tokenizer. We're thinking of a way to enable this by default for fast tokenizers, but this isn't blocking for this PR. Thanks!

@LysandreJik LysandreJik merged commit 81fe0bf into huggingface:master Nov 27, 2020
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
* Add init barthez

* Add barthez model, tokenizer and docs

BARThez is a pre-trained french seq2seq model that uses BART objective.

* Apply suggestions from code review docs typos

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add license

* Change URLs scheme

* Remove barthez model keep tokenizer

* Fix style

* Fix quality

* Update tokenizer

* Add fast tokenizer

* Add fast tokenizer test

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants