Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MarianTokenizer to remove metaspace character in decode #26091

Merged
merged 5 commits into from
Sep 12, 2023

Conversation

tanaymeh
Copy link
Contributor

What does this PR do?

This PR fixes the MarianTokenizer so that it removes the metaspace character during decode ().

Fixes #26018

Who can review?

@ArthurZucker, @xenova

Comment on lines 272 to 274
if tokens[0].startswith(SPIECE_UNDERLINE):
tokens[0] = tokens[0][1:]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section necessary still?

Copy link
Contributor Author

@tanaymeh tanaymeh Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not, I accidentally left it there! Fixed it now.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We need to add a test, and make sure the fast tokenizers also works! 🤗

@xenova
Copy link
Contributor

xenova commented Sep 11, 2023

make sure the fast tokenizers also works!

Unfortunately, there's no MarianTokenizerFast 😅

@tanaymeh
Copy link
Contributor Author

Thanks for the review, @ArthurZucker!
I added the following test (it checks with starting special characters in a string like an underscore).

def test_tokenizer_decode(self):
    tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
    source_text = "_This is 1 text string that starts with an _ and ends with one too _"
    ids = tokenizer(source_text)["input_ids"]
    output_text = tokenizer.decode(ids, skip_special_tokens=True)
    self.assertEqual(source_text, output_text)

Does it look good to you?

@tanaymeh
Copy link
Contributor Author

make sure the fast tokenizers also works!

Unfortunately, there's no MarianTokenizerFast 😅

@xenova @ArthurZucker I would love to add the MarianTokenizerFast if that is considered a worthwhile addition to 🤗 transformers!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, not really a fan of relying on the strip for this kind of process but it's quick fix so should be okay!

Comment on lines 155 to 156
source_text = "_This is 1 text string that starts with an _ and ends with one too _"
ids = tokenizer(source_text)["input_ids"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of manually adding the spiece underline we can just use the example from the issue: tokenizer.decode(tokenizer("hello world")['input_ids'], skip_special_tokens=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Also, I can't quite tell on mobile, but are those just regular underscores? The sentencepiece underscore is a slightly different character (but looks similar).

Copy link
Contributor Author

@tanaymeh tanaymeh Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault. My VSCode theme really made them both look similar 🥲
Fixed it now!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! This is breaking but it's a bug fix so good to go!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ArthurZucker ArthurZucker merged commit 12f043e into huggingface:main Sep 12, 2023
18 checks passed
@tanaymeh
Copy link
Contributor Author

Thanks for the reviews and merging!
I was wondering @ArthurZucker if MarianTokenizerFast would be a worthwhile addition to HF transformers and if I can contribute to adding it.

@ArthurZucker
Copy link
Collaborator

It does not seem to have been requested a lot 😄 but feel free to add it if you want some good experience with tokenizers

@xenova
Copy link
Contributor

xenova commented Sep 13, 2023

I think this would be a good addition, simply because of the number of monthly downloads the >1400 models get 😇. The top 5 models alone total ~2.75M downloads in the past month.

https://huggingface.co/models?sort=downloads&search=Helsinki-NLP%2Fopus-mt

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
…ingface#26091)

* add: check to remove metaspace from marian tokenizer

* fix: metaspace character being removed from everywhere

* fix: remove redundant check at top

* add: test for marian tokenizer decode fix

* fix: simplified the test
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…ingface#26091)

* add: check to remove metaspace from marian tokenizer

* fix: metaspace character being removed from everywhere

* fix: remove redundant check at top

* add: test for marian tokenizer decode fix

* fix: simplified the test
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
…ingface#26091)

* add: check to remove metaspace from marian tokenizer

* fix: metaspace character being removed from everywhere

* fix: remove redundant check at top

* add: test for marian tokenizer decode fix

* fix: simplified the test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Helsinki-NLP/opus-* models decode not removing metaspace character
4 participants