-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Marian] documentation and AutoModel support #4152
[Marian] documentation and AutoModel support #4152
Conversation
tests/test_modeling_marian.py
Outdated
# bad_words_ids=[[self.tokenizer.pad_token_id]], | ||
decoder_start_token_id=self.tokenizer.pad_token_id, # mimics 0 embedding at first step. | ||
) | ||
generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that there is a decode_batch function in the Marian tokenizer. I think we should remove this or is it planned that this function is also in tokenizer_utils.py
? I think we should keep each models API in general as small as possible and try to only expose the common _utils
functions.
IMO, the user will get used ot the function decode_batch
and wonder why it does not exist for other tokenizers and not for the FastTokenizer either. Or we should implement a general decode_batch
function in tokenization_utils
, but not for an individual tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to hoist it up to PretrainedTokenizer.
#4159 does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me.
Maybe let's use the same naming convention with batch_decode
indeed (not that I'm a big fan of the one we have right now for batch_encode
but it's better to be consistent indeed)
"Tom really admired Mary's courage.", | ||
"Turn around and close your eyes.", | ||
] | ||
expected_text = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great translations :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool! Is it ready to be added to the docs (with a model_doc/marian.rst
) or not yet?
def parse_readmes(repo_path): | ||
def make_registry(repo_path="Opus-MT-train/models"): | ||
if not Path(repo_path).exists(): | ||
raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe specify that the repo_path
was invalid?
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
tests/test_modeling_marian.py
Outdated
# bad_words_ids=[[self.tokenizer.pad_token_id]], | ||
decoder_start_token_id=self.tokenizer.pad_token_id, # mimics 0 embedding at first step. | ||
) | ||
generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me.
Maybe let's use the same naming convention with batch_decode
indeed (not that I'm a big fan of the one we have right now for batch_encode
but it's better to be consistent indeed)
Metrics:
For fr-en test set:
MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-fr-en')
: 57.4817