🚨🚨[Whisper Tok] Update integration test#29368
🚨🚨[Whisper Tok] Update integration test#29368sanchit-gandhi merged 2 commits intohuggingface:mainfrom
Conversation
| self.assertListEqual( | ||
| tokenizer.convert_tokens_to_ids(tokens), | ||
| [5723, 307, 257, 220, 31636], | ||
| [5723, 307, 257, 1500], |
There was a problem hiding this comment.
This now gives equivalent results to the original:
from whisper.tokenizer import get_tokenizer
tokenizer = get_tokenizer(True)
tokens = tokenizer.encode("This is a test")
print(tokens)Print Output:
[5723, 307, 257, 1500]
| self.assertEqual(output, []) | ||
|
|
||
| @require_jinja | ||
| def test_tokenization_for_chat(self): |
There was a problem hiding this comment.
Chat template doesn't make sense for Whisper (a speech recognition model) - have removed the test to keep the CI lightweight (cc @Rocketknight1)
|
Also cc @ydshieh as this PR will prevent a red CI on |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for the prompt fix, it's breaking so I'll probably update the PR tittle with
|
The GH PR itself is not strictly breaking (there's no change to the code), but rather it's the Hub PR which is breaking. Fine for me to leave the 🚨 in the title though to book-log this! |
What does this PR do?
The merges for the Whisper tokenizers were updated on the Hub in this PR. While this is a breaking change, it is a required fix to ensure we have parity with the original OpenAI repo.
This PR updates the integration tests for the Whisper tokenizer to reflect the merge changes.