-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MarianTokenizer
to remove metaspace character in decode
#26091
Conversation
if tokens[0].startswith(SPIECE_UNDERLINE): | ||
tokens[0] = tokens[0][1:] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this section necessary still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not, I accidentally left it there! Fixed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We need to add a test, and make sure the fast
tokenizers also works! 🤗
Unfortunately, there's no |
Thanks for the review, @ArthurZucker! def test_tokenizer_decode(self):
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
source_text = "_This is 1 text string that starts with an _ and ends with one too _"
ids = tokenizer(source_text)["input_ids"]
output_text = tokenizer.decode(ids, skip_special_tokens=True)
self.assertEqual(source_text, output_text) Does it look good to you? |
@xenova @ArthurZucker I would love to add the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, not really a fan of relying on the strip
for this kind of process but it's quick fix so should be okay!
source_text = "_This is 1 text string that starts with an _ and ends with one too _" | ||
ids = tokenizer(source_text)["input_ids"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of manually adding the spiece underline we can just use the example from the issue: tokenizer.decode(tokenizer("hello world")['input_ids'], skip_special_tokens=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also, I can't quite tell on mobile, but are those just regular underscores? The sentencepiece underscore is a slightly different character (but looks similar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fault. My VSCode theme really made them both look similar 🥲
Fixed it now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing! This is breaking but it's a bug fix so good to go!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for the reviews and merging! |
It does not seem to have been requested a lot 😄 but feel free to add it if you want some good experience with |
I think this would be a good addition, simply because of the number of monthly downloads the >1400 models get 😇. The top 5 models alone total ~2.75M downloads in the past month. https://huggingface.co/models?sort=downloads&search=Helsinki-NLP%2Fopus-mt |
…ingface#26091) * add: check to remove metaspace from marian tokenizer * fix: metaspace character being removed from everywhere * fix: remove redundant check at top * add: test for marian tokenizer decode fix * fix: simplified the test
…ingface#26091) * add: check to remove metaspace from marian tokenizer * fix: metaspace character being removed from everywhere * fix: remove redundant check at top * add: test for marian tokenizer decode fix * fix: simplified the test
…ingface#26091) * add: check to remove metaspace from marian tokenizer * fix: metaspace character being removed from everywhere * fix: remove redundant check at top * add: test for marian tokenizer decode fix * fix: simplified the test
What does this PR do?
This PR fixes the
MarianTokenizer
so that it removes the metaspace character during decode (▁
).Fixes #26018
Who can review?
@ArthurZucker, @xenova