Fix marian tokenizer save pretrained #5043

sshleifer · 2020-06-16T00:18:18Z

No description provided.

codecov · 2020-06-16T01:21:36Z

Codecov Report

Merging #5043 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5043   +/-   ##
=======================================
  Coverage   77.36%   77.37%           
=======================================
  Files         130      130           
  Lines       21989    21990    +1     
=======================================
+ Hits        17012    17014    +2     
+ Misses       4977     4976    -1

Impacted Files	Coverage Δ
src/transformers/tokenization_marian.py	`92.85% <100.00%> (+0.96%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3643422...5899f7a. Read the comment docs.

patrickvonplaten · 2020-06-16T13:23:56Z

tests/test_tokenization_marian.py

    def test_tokenizer_equivalence_en_de(self):
        en_de_tokenizer = MarianTokenizer.from_pretrained(f"{ORG_NAME}opus-mt-en-de")
        batch = en_de_tokenizer.prepare_translation_batch(["I am a small frog"], return_tensors=None)
        self.assertIsInstance(batch, BatchEncoding)
        expected = [38, 121, 14, 697, 38848, 0]
        self.assertListEqual(expected, batch.input_ids[0])
+
+        save_dir = tempfile.mkdtemp()


nit I guess, but I like the context manager approach better:

with tempfile.TemporaryDirectory() as tmp_dir: ....

patrickvonplaten · 2020-06-16T13:24:23Z

tests/test_tokenization_marian.py

@@ -60,10 +60,15 @@ def get_input_output_texts(self, tokenizer):
            "This is a test",
        )

-    @slow
    def test_tokenizer_equivalence_en_de(self):
        en_de_tokenizer = MarianTokenizer.from_pretrained(f"{ORG_NAME}opus-mt-en-de")
        batch = en_de_tokenizer.prepare_translation_batch(["I am a small frog"], return_tensors=None)
        self.assertIsInstance(batch, BatchEncoding)
        expected = [38, 121, 14, 697, 38848, 0]


Would be nice to write the expected result as a comment for better readability

Fix marian tokenizer save pretrained

a9e12fe

sshleifer linked an issue Jun 16, 2020 that may be closed by this pull request

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

Closed

sshleifer requested review from LysandreJik and patrickvonplaten June 16, 2020 00:20

sshleifer added the marian label Jun 16, 2020

style

5899f7a

patrickvonplaten reviewed Jun 16, 2020

View reviewed changes

patrickvonplaten approved these changes Jun 16, 2020

View reviewed changes

sshleifer merged commit 3d495c6 into huggingface:master Jun 16, 2020

sshleifer deleted the martok-fix branch June 16, 2020 13:48

pgfeldman mentioned this pull request Jun 17, 2020

Windows: Can't find vocabulary file for MarianTokenizer #4491

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix marian tokenizer save pretrained #5043

Fix marian tokenizer save pretrained #5043

sshleifer commented Jun 16, 2020

codecov bot commented Jun 16, 2020 •

edited

patrickvonplaten Jun 16, 2020

patrickvonplaten Jun 16, 2020

Fix marian tokenizer save pretrained #5043

Fix marian tokenizer save pretrained #5043

Conversation

sshleifer commented Jun 16, 2020

codecov bot commented Jun 16, 2020 • edited

Codecov Report

patrickvonplaten Jun 16, 2020

Choose a reason for hiding this comment

patrickvonplaten Jun 16, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 16, 2020 •

edited