Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix marian tokenizer save pretrained #5043

Merged
merged 2 commits into from Jun 16, 2020

Conversation

sshleifer
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #5043 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5043   +/-   ##
=======================================
  Coverage   77.36%   77.37%           
=======================================
  Files         130      130           
  Lines       21989    21990    +1     
=======================================
+ Hits        17012    17014    +2     
+ Misses       4977     4976    -1     
Impacted Files Coverage Δ
src/transformers/tokenization_marian.py 92.85% <100.00%> (+0.96%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3643422...5899f7a. Read the comment docs.

def test_tokenizer_equivalence_en_de(self):
en_de_tokenizer = MarianTokenizer.from_pretrained(f"{ORG_NAME}opus-mt-en-de")
batch = en_de_tokenizer.prepare_translation_batch(["I am a small frog"], return_tensors=None)
self.assertIsInstance(batch, BatchEncoding)
expected = [38, 121, 14, 697, 38848, 0]
self.assertListEqual(expected, batch.input_ids[0])

save_dir = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit I guess, but I like the context manager approach better:

with tempfile.TemporaryDirectory() as tmp_dir:
....

@@ -60,10 +60,15 @@ def get_input_output_texts(self, tokenizer):
"This is a test",
)

@slow
def test_tokenizer_equivalence_en_de(self):
en_de_tokenizer = MarianTokenizer.from_pretrained(f"{ORG_NAME}opus-mt-en-de")
batch = en_de_tokenizer.prepare_translation_batch(["I am a small frog"], return_tensors=None)
self.assertIsInstance(batch, BatchEncoding)
expected = [38, 121, 14, 697, 38848, 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to write the expected result as a comment for better readability

@sshleifer sshleifer merged commit 3d495c6 into huggingface:master Jun 16, 2020
@sshleifer sshleifer deleted the martok-fix branch June 16, 2020 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory
2 participants