Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Marian] documentation and AutoModel support #4152

Merged
merged 57 commits into from
May 10, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented May 5, 2020

  • Adds integration tests for en-fr, fr-en.
  • Easier bulk conversion
  • remove unused pretrained_model_archive_map constant.
  • boilerplate to make AutoModelWithLMHead, AutoTokenizer, AutoConfig work.

Metrics:

For fr-en test set:

  • BLEU score from posted translations: 57.4979
  • BLEU score from MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-fr-en'): 57.4817
  • no perf change for fp16
  • can fit batch_size=512 on 16GB card in fp16
  • speed: 89s for 5k examples = 56 examples/second

@sshleifer sshleifer changed the title [WIP] Marian cleanup and example Marian cleanup and docstring example May 5, 2020
# bad_words_ids=[[self.tokenizer.pad_token_id]],
decoder_start_token_id=self.tokenizer.pad_token_id, # mimics 0 embedding at first step.
)
generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that there is a decode_batch function in the Marian tokenizer. I think we should remove this or is it planned that this function is also in tokenizer_utils.py ? I think we should keep each models API in general as small as possible and try to only expose the common _utils functions.

IMO, the user will get used ot the function decode_batch and wonder why it does not exist for other tokenizers and not for the FastTokenizer either. Or we should implement a general decode_batch function in tokenization_utils , but not for an individual tokenizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to hoist it up to PretrainedTokenizer.
#4159 does that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me.

Maybe let's use the same naming convention with batch_decode indeed (not that I'm a big fan of the one we have right now for batch_encode but it's better to be consistent indeed)

"Tom really admired Mary's courage.",
"Turn around and close your eyes.",
]
expected_text = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great translations :-)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Is it ready to be added to the docs (with a model_doc/marian.rst) or not yet?

def parse_readmes(repo_path):
def make_registry(repo_path="Opus-MT-train/models"):
if not Path(repo_path).exists():
raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify that the repo_path was invalid?

src/transformers/modeling_marian.py Outdated Show resolved Hide resolved
sshleifer and others added 2 commits May 5, 2020 13:39
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# bad_words_ids=[[self.tokenizer.pad_token_id]],
decoder_start_token_id=self.tokenizer.pad_token_id, # mimics 0 embedding at first step.
)
generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me.

Maybe let's use the same naming convention with batch_decode indeed (not that I'm a big fan of the one we have right now for batch_encode but it's better to be consistent indeed)

@sshleifer sshleifer changed the title Marian cleanup and docstring example [Marian] documentation and AutoModel support May 10, 2020
@sshleifer sshleifer merged commit 3487be7 into huggingface:master May 10, 2020
@sshleifer sshleifer deleted the marian-cleanup-and-example branch May 10, 2020 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Marian] Key-Error for some languages [Marian] @-@ symbol causes strange generations
8 participants