[Marian] documentation and AutoModel support #4152

sshleifer · 2020-05-05T12:53:00Z

Adds integration tests for en-fr, fr-en.
Easier bulk conversion
remove unused pretrained_model_archive_map constant.
boilerplate to make AutoModelWithLMHead, AutoTokenizer, AutoConfig work.

Metrics:

For fr-en test set:

BLEU score from posted translations: 57.4979
BLEU score from MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-fr-en'): 57.4817
no perf change for fp16
can fit batch_size=512 on 16GB card in fp16
speed: 89s for 5k examples = 56 examples/second

tests/test_modeling_marian.py

patrickvonplaten · 2020-05-05T16:20:08Z

tests/test_modeling_marian.py

+            # bad_words_ids=[[self.tokenizer.pad_token_id]],
+            decoder_start_token_id=self.tokenizer.pad_token_id,  # mimics 0 embedding at first step.
+        )
+        generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)


Just noticed that there is a decode_batch function in the Marian tokenizer. I think we should remove this or is it planned that this function is also in tokenizer_utils.py ? I think we should keep each models API in general as small as possible and try to only expose the common _utils functions.

IMO, the user will get used ot the function decode_batch and wonder why it does not exist for other tokenizers and not for the FastTokenizer either. Or we should implement a general decode_batch function in tokenization_utils , but not for an individual tokenizer.

@thomwolf @LysandreJik

Would love to hoist it up to PretrainedTokenizer.
#4159 does that.

Fine for me.

Maybe let's use the same naming convention with batch_decode indeed (not that I'm a big fan of the one we have right now for batch_encode but it's better to be consistent indeed)

patrickvonplaten · 2020-05-05T16:20:51Z

tests/test_modeling_marian.py

+        "Tom really admired Mary's courage.",
+        "Turn around and close your eyes.",
+    ]
+    expected_text = [


Great translations :-)

LysandreJik

This is cool! Is it ready to be added to the docs (with a model_doc/marian.rst) or not yet?

LysandreJik · 2020-05-05T16:10:00Z

src/transformers/convert_marian_to_pytorch.py

-def parse_readmes(repo_path):
+def make_registry(repo_path="Opus-MT-train/models"):
+    if not Path(repo_path).exists():
+        raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git")


Maybe specify that the repo_path was invalid?

src/transformers/modeling_marian.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

tests/test_modeling_marian.py

thomwolf

LGTM!

thomwolf · 2020-05-07T23:10:46Z

tests/test_modeling_marian.py

+            # bad_words_ids=[[self.tokenizer.pad_token_id]],
+            decoder_start_token_id=self.tokenizer.pad_token_id,  # mimics 0 embedding at first step.
+        )
+        generated_words = self.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)


Fine for me.

Maybe let's use the same naming convention with batch_decode indeed (not that I'm a big fan of the one we have right now for batch_encode but it's better to be consistent indeed)

tests/test_modeling_marian.py

sshleifer added 16 commits April 28, 2020 18:27

jupyter fn

98c8f4d

check exist

81930ca

also convert

12b2e1f

Find vocab file

bafb192

Added example

b29770a

passing

b2d8379

failing fren test

db40920

boom boom

3715f81

test with padding, failing

5910f00

integration tests

67ccfca

Merge branch 'master' into convert-many

0b53afc

URL in docstring

f2798ce

Merge branch 'master' into convert-many

cb3fe19

Remove unused

d1df26d

dont inherit bart map

6d6064b

style

8c90ed4

sshleifer commented May 5, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

sshleifer changed the title ~~[WIP] Marian cleanup and example~~ Marian cleanup and docstring example May 5, 2020

sshleifer requested review from julien-c, LysandreJik and patrickvonplaten May 5, 2020 13:13

patrickvonplaten reviewed May 5, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 5, 2020

View reviewed changes

LysandreJik reviewed May 5, 2020

View reviewed changes

sshleifer and others added 2 commits May 5, 2020 13:39

Update src/transformers/modeling_marian.py

0274c91

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

style

9e826c5

stefan-it reviewed May 6, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

use configs generation kwargs

e7b6b08

stefan-it reviewed May 7, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

julien-c force-pushed the master branch from 77d0e6f to c99fe03 Compare May 7, 2020 23:04

thomwolf approved these changes May 7, 2020

View reviewed changes

sshleifer added 20 commits May 7, 2020 19:20

batch_decode

c525614

upstream

3fdaddc

docs

4ed9067

merged master

0f09b9f

Merge remote-tracking branch 'upstream/master'

84b5e23

Merge remote-tracking branch 'upstream/master'

4d2e6ee

boom boom

94de64b

Merge remote-tracking branch 'upstream/master'

c8f161e

Remove unwanted file

b4cae61

update convert dest

75b7589

convert configs properly

d6c5057

ch_group_replace

02e2de2

lang code attempt

e170229

Undo lang code support

9093411

Merge remote-tracking branch 'upstream/master'

ce3ff3e

Merge branch 'master' into marian-cleanup-and-example

09264f2

remove type hint

b5ac92f

Merge remote-tracking branch 'upstream/master'

fca7ae4

Merge branch 'master' into marian-cleanup-and-example

28ca4ed

typehint inside fn

fe6982c

BramVanroy reviewed May 10, 2020

View reviewed changes

tests/test_modeling_marian.py Outdated Show resolved Hide resolved

conversion utils

a01a3fe

sshleifer added the marian label May 10, 2020

sshleifer changed the title ~~Marian cleanup and docstring example~~ [Marian] documentation and AutoModel support May 10, 2020

sshleifer merged commit 3487be7 into huggingface:master May 10, 2020

sshleifer deleted the marian-cleanup-and-example branch May 10, 2020 17:55

simonepri mentioned this pull request Jun 4, 2020

couldn't load pretrained gpt2 simonepri/lm-scorer#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Marian] documentation and AutoModel support #4152

[Marian] documentation and AutoModel support #4152

sshleifer commented May 5, 2020 •

edited

Loading

patrickvonplaten May 5, 2020

patrickvonplaten May 5, 2020

sshleifer May 5, 2020

thomwolf May 7, 2020

patrickvonplaten May 5, 2020

LysandreJik left a comment

LysandreJik May 5, 2020

thomwolf left a comment

thomwolf May 7, 2020

[Marian] documentation and AutoModel support #4152

[Marian] documentation and AutoModel support #4152

Conversation

sshleifer commented May 5, 2020 • edited Loading

Metrics:

patrickvonplaten May 5, 2020

Choose a reason for hiding this comment

patrickvonplaten May 5, 2020

Choose a reason for hiding this comment

sshleifer May 5, 2020

Choose a reason for hiding this comment

thomwolf May 7, 2020

Choose a reason for hiding this comment

patrickvonplaten May 5, 2020

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik May 5, 2020

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

thomwolf May 7, 2020

Choose a reason for hiding this comment

sshleifer commented May 5, 2020 •

edited

Loading