-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add barthez model #8393
Add barthez model #8393
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, you might want to add an integration test for summarization!
tests/test_modeling_barthez.py
Outdated
@require_tokenizers | ||
class BarthezModelIntegrationTest(unittest.TestCase): | ||
@slow | ||
def test_output_embeds_base_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add an integration test for summarization/generate?
You might also test that the config is identical to |
1596c0f
to
a74deae
Compare
Thank you @sshleifer for your review, I added some additional integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the model, it looks very cool! I mostly have nit-picking comments for the documentation.
I'm a bit surprised by the complexity of the tokenizer however. Why is there a sentencepiece model and a different vocab file which makes you have to override private methods instead of using the tokenizer API (and probably prevents a fast tokenizer)? Can't we just change the spm file to have the proper vocabulary?
b3d8264
to
42dd96e
Compare
Hi @sgugger, thank you for your review, I added all the proposed changes. As for the tokenizer, the reason for having a vocab file, is that mBarthez uses the same sentencepiece tokenizer as mBart, while discarding tokens with non-Latin characters from the embedding layers. So basically the token-ids mapping has changed. I am not sure if it is possible to change the sentencepiece model itself. Anyway I think we can keep it like that for the moment. Please let me know if you would like to recommend any other changes. |
|
||
VOCAB_FILES_NAMES = {"sentence_piece_model": "sentencepiece.bpe.model", "vocab_file": "vocab.json"} | ||
|
||
PRETRAINED_VOCAB_FILES_MAP = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Does not have to changed in this PR) @julien-c I think we don't need this PRETRAINED_VOCAB_FILES_MAP
actually anymore at all no? The better way would be to just move all tokenizer files to the respective folders I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both @LysandreJik and @thomwolf said they'll take a look :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Good to merge IMO.
logger = logging.get_logger(__name__) | ||
|
||
BARTHEZ_PRETRAINED_CONFIG_ARCHIVE_MAP = { | ||
"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"moussaKam/barthez": "https://s3.amazonaws.com/models.huggingface.co/bert/moussaKam/barthez/config.json", | |
"moussaKam/barthez": "https://huggingface.co/moussaKam/barthez/resolve/main/config.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update URL to current scheme (even though those urls shouldn't be used anymore...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0a77338
42dd96e
to
0a77338
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks for your contribution! I don't think the models need to be defined, however. They're basically a complete copy of the BART model, with no changes done to the configuration either. The main difference is the tokenizer.
Since #6995, tokenizers can be decoupled from their models, which would be ideal for this PR.
Instead of redefining all barthez models which are basically inheriting from BART, you would simply load the model checkpoint in the BART architecture. However, there would still be a BarthezTokenizer
as this requires additional code.
You can take inspiration from the Phobert model. See how its configuration file mentions that it is based on a RoBERTa implementation, but still leverages a PhobertTokenizer
. You can see its implementation details here.
You can also take inspiration from Herbert here, which leveraged #6995 as well.
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. | ||
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): | ||
The non-linear activation function (function or string) in the encoder and pooler. If string, | ||
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. | |
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. |
They're the same, but SILU is older :) see #8100 for more information
0a77338
to
a7c051b
Compare
Hi @LysandreJik, thank you for the review. Yes you're right, actually I was hesitating whether to redefine barthez models or not. Anyway I modified the code as requested. Hope it's ok now. |
946ef4d
to
daa2243
Compare
@LysandreJik @julien-c Please let me know if any other modifications are required. Thank you in advance :) |
Hi @moussaKam! Sorry about getting back to you so late. The issue with this PR is with the implementation of the tokenizer and its fast tokenizer counterpart. Right now there is no It seems that the Do you think you could take a look at the |
9a872cd
to
3fa5d0e
Compare
BARThez is a pre-trained french seq2seq model that uses BART objective.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
3fa5d0e
to
d73bfd8
Compare
Hi @LysandreJik, I added the fast tokenizer, thank you for the tip! Please let me know if we're good now! :) |
tests/test_tokenization_barthez.py
Outdated
class BarthezTokenizationTest(TokenizerTesterMixin, unittest.TestCase): | ||
|
||
tokenizer_class = BarthezTokenizer | ||
test_rust_tokenizer = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing the fast tokenizer! It would be nice to add it to the tests.
Could you switch that to True
and test the fast tokenizer as well? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @LysandreJik, I switched to the fast tokenization test. However I noticed that if the sentencepiece model uses bpe tokenization instead of unigram (model_type = 2)
elif model_type == 2: |
the conversion to fast tokenizer will be slow, and the integration test will be slow as well. In my case BARThez model uses type 2, do you think it is possible to save the
tokenizer.json
when calling tokenizer.save_pretrained()
to avoid that the model performs the conversion when loading the tokenizer from a saved directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that takes a while. Let me have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use legacy_format=False
to save the tokenizer.json
file directly, but I feel like this should be automatically done for the fast tokenizers. Looking into it.
Hi @LysandreJik, do you think we still need any changes? |
I think we can merge it as it is right now, and use the |
* Add init barthez * Add barthez model, tokenizer and docs BARThez is a pre-trained french seq2seq model that uses BART objective. * Apply suggestions from code review docs typos Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add license * Change URLs scheme * Remove barthez model keep tokenizer * Fix style * Fix quality * Update tokenizer * Add fast tokenizer * Add fast tokenizer test Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
Add BARThez models, tokenizer and docs. BARThez is a french seq2seq model that uses BART objective and architecture (https://arxiv.org/abs/2010.12321)
Before submitting
Pull Request section?
to the it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
@patrickvonplaten @sshleifer