Fix `MarianTokenizer` to remove metaspace character in `decode` #26091

tanaymeh · 2023-09-11T10:38:56Z

What does this PR do?

This PR fixes the MarianTokenizer so that it removes the metaspace character during decode (▁).

Who can review?

xenova · 2023-09-11T10:47:28Z

src/transformers/models/marian/tokenization_marian.py

+        if tokens[0].startswith(SPIECE_UNDERLINE):
+            tokens[0] = tokens[0][1:]
+


Is this section necessary still?

No it's not, I accidentally left it there! Fixed it now.

ArthurZucker

Thanks! We need to add a test, and make sure the fast tokenizers also works! 🤗

xenova · 2023-09-11T14:13:26Z

make sure the fast tokenizers also works!

Unfortunately, there's no MarianTokenizerFast 😅

tanaymeh · 2023-09-11T17:44:54Z

Thanks for the review, @ArthurZucker!
I added the following test (it checks with starting special characters in a string like an underscore).

def test_tokenizer_decode(self):
    tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
    source_text = "_This is 1 text string that starts with an _ and ends with one too _"
    ids = tokenizer(source_text)["input_ids"]
    output_text = tokenizer.decode(ids, skip_special_tokens=True)
    self.assertEqual(source_text, output_text)

Does it look good to you?

tanaymeh · 2023-09-11T17:46:01Z

make sure the fast tokenizers also works!

Unfortunately, there's no MarianTokenizerFast 😅

@xenova @ArthurZucker I would love to add the MarianTokenizerFast if that is considered a worthwhile addition to 🤗 transformers!

ArthurZucker

LGTM, not really a fan of relying on the strip for this kind of process but it's quick fix so should be okay!

ArthurZucker · 2023-09-11T19:26:28Z

tests/models/marian/test_tokenization_marian.py

+        source_text = "_This is 1 text string that starts with an _ and ends with one too _"
+        ids = tokenizer(source_text)["input_ids"]


Instead of manually adding the spiece underline we can just use the example from the issue: tokenizer.decode(tokenizer("hello world")['input_ids'], skip_special_tokens=True)

+1

Also, I can't quite tell on mobile, but are those just regular underscores? The sentencepiece underscore is a slightly different character (but looks similar).

My fault. My VSCode theme really made them both look similar 🥲
Fixed it now!

ArthurZucker

Thanks for fixing! This is breaking but it's a bug fix so good to go!

HuggingFaceDocBuilderDev · 2023-09-12T10:42:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

xenova

LGTM!

tanaymeh · 2023-09-13T03:25:52Z

Thanks for the reviews and merging!
I was wondering @ArthurZucker if MarianTokenizerFast would be a worthwhile addition to HF transformers and if I can contribute to adding it.

ArthurZucker · 2023-09-13T09:38:34Z

It does not seem to have been requested a lot 😄 but feel free to add it if you want some good experience with tokenizers

xenova · 2023-09-13T10:08:31Z

I think this would be a good addition, simply because of the number of monthly downloads the >1400 models get 😇. The top 5 models alone total ~2.75M downloads in the past month.

https://huggingface.co/models?sort=downloads&search=Helsinki-NLP%2Fopus-mt

…ingface#26091) * add: check to remove metaspace from marian tokenizer * fix: metaspace character being removed from everywhere * fix: remove redundant check at top * add: test for marian tokenizer decode fix * fix: simplified the test

tanaymeh added 2 commits September 11, 2023 15:54

add: check to remove metaspace from marian tokenizer

4a07c8a

fix: metaspace character being removed from everywhere

fdb2fcb

xenova reviewed Sep 11, 2023

View reviewed changes

fix: remove redundant check at top

11add1d

ArthurZucker reviewed Sep 11, 2023

View reviewed changes

add: test for marian tokenizer decode fix

132cf85

ArthurZucker approved these changes Sep 11, 2023

View reviewed changes

fix: simplified the test

d2916e5

ArthurZucker approved these changes Sep 12, 2023

View reviewed changes

tanaymeh requested a review from xenova September 12, 2023 13:55

xenova approved these changes Sep 12, 2023

View reviewed changes

ArthurZucker merged commit 12f043e into huggingface:main Sep 12, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `MarianTokenizer` to remove metaspace character in `decode` #26091

Fix `MarianTokenizer` to remove metaspace character in `decode` #26091

tanaymeh commented Sep 11, 2023

xenova Sep 11, 2023

tanaymeh Sep 11, 2023 •

edited

ArthurZucker left a comment

xenova commented Sep 11, 2023

tanaymeh commented Sep 11, 2023

tanaymeh commented Sep 11, 2023

ArthurZucker left a comment

ArthurZucker Sep 11, 2023

xenova Sep 11, 2023

tanaymeh Sep 12, 2023 •

edited

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Sep 12, 2023

xenova left a comment

tanaymeh commented Sep 13, 2023

ArthurZucker commented Sep 13, 2023

xenova commented Sep 13, 2023

		if tokens[0].startswith(SPIECE_UNDERLINE):
		tokens[0] = tokens[0][1:]

		source_text = "_This is 1 text string that starts with an _ and ends with one too _"
		ids = tokenizer(source_text)["input_ids"]

Fix MarianTokenizer to remove metaspace character in decode #26091

Fix MarianTokenizer to remove metaspace character in decode #26091

Conversation

tanaymeh commented Sep 11, 2023

What does this PR do?

Who can review?

xenova Sep 11, 2023

Choose a reason for hiding this comment

tanaymeh Sep 11, 2023 • edited

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

xenova commented Sep 11, 2023

tanaymeh commented Sep 11, 2023

tanaymeh commented Sep 11, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 11, 2023

Choose a reason for hiding this comment

xenova Sep 11, 2023

Choose a reason for hiding this comment

tanaymeh Sep 12, 2023 • edited

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 12, 2023

xenova left a comment

Choose a reason for hiding this comment

tanaymeh commented Sep 13, 2023

ArthurZucker commented Sep 13, 2023

xenova commented Sep 13, 2023

Fix `MarianTokenizer` to remove metaspace character in `decode` #26091

Fix `MarianTokenizer` to remove metaspace character in `decode` #26091

tanaymeh Sep 11, 2023 •

edited

tanaymeh Sep 12, 2023 •

edited