-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Tokenizers By Default #580
Convert Tokenizers By Default #580
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
it seems useless until we need to install openvino tokenizers as extra explicitly, please consider including it in openvino extra |
@apaniukov, please add openvino_tokenizer in openvino extra in this PR. |
Add check for installed openvino-nightly package as well. Improve compatibility messages. Check is OpenVINO Tokenizers is available only if tokenizer is exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created comments about the error messages as I was testing this, but there are quite a few things that can go wrong or cause confusion. IMO it is a better user experience to just show one line like "OpenVINO tokenizer model could not be exported. This does not impact model export. See [documentation url] for more info" and then explain there how to make sure all versions are in sync.
@echarlaix could you merge the PR? The failed IPEX test is not connected with it. |
@@ -122,7 +122,7 @@ def test_exporters_cli(self, task: str, model_type: str): | |||
def test_exporters_cli_tokenizers(self, task: str, model_type: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shoudl remove :
- @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and add logic checks for "Test openvino-nightly" stage, where the libraries are not compatible.
@@ -327,7 +320,7 @@ class StoreAttr(object): | |||
**kwargs_shapes, | |||
) | |||
|
|||
if convert_tokenizer: | |||
if convert_tokenizer and is_openvino_tokenizers_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer export results in additional files being included which could be confusing for users, can this be moved to to an openvino_tokenizers
directory ?
if convert_tokenizer and is_openvino_tokenizers_available(): | |
if convert_tokenizer and is_openvino_tokenizers_available(): | |
output = Path(output) / "openvino_tokenizers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the original tokenizer, so we replicate an existing pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about something similar than what's done for SD models is this something that sounds reasonable to you? I'm fine with adding it in a following PR if you prefer not including it in this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @apaniukov !
if not is_openvino_tokenizers_available(): | ||
self.assertTrue( | ||
"OpenVINO Tokenizers is not available." in output | ||
or "OpenVINO and OpenVINO Tokenizers versions are not binary compatible." in output, | ||
msg=output, | ||
) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer to have it removed to make sure this is always tested (we should always have a compatible version of openvino
/ openvino-tokenizer
when testing)
if not is_openvino_tokenizers_available(): | |
self.assertTrue( | |
"OpenVINO Tokenizers is not available." in output | |
or "OpenVINO and OpenVINO Tokenizers versions are not binary compatible." in output, | |
msg=output, | |
) | |
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added because of this test:
- name: Test openvino-nightly |
It deletes openvino and installs openvino-nightly, which should be incompatible with the installed tokenizer version, replicating the incompatibility scenario.
@@ -118,21 +118,23 @@ def test_exporters_cli(self, task: str, model_type: str): | |||
for arch in SUPPORTED_ARCHITECTURES | |||
if not arch[0].endswith("-with-past") and not arch[1].endswith("-refiner") | |||
) | |||
@unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available") | |||
def test_exporters_cli_tokenizers(self, task: str, model_type: str): | |||
with TemporaryDirectory() as tmpdir: | |||
output = subprocess.check_output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also check whether no error was raised here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
optimum-cli export openvino --model hf-internal-testing/tiny-random-t5 --task text2text-generation ov_model
I'm getting the following error :
OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to have this fixed before making the tokenizer export default / merging this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a bug, we don't support the Unigram model from tokenizer.json
yet, only from sentencepiece model file. If you use a repository that has such a file, the tokenizer will be converted.
I added one more check for the vocab file here: openvinotoolkit/openvino_tokenizers#116
This will transform
OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'
into:
OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: Cannot convert tokenizer of this type without `.model` file.
The issue is that the original tokenizer has info about .model
file in the tokenizer object, but it does not have it in practice:
I would argue that this is an edge case of a test repo and most tokenizers with info about the .model
file have it, so it should not block the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable for not all cases to be supported, but for these cases we should disable the tokenizer export and not throw an error as it won't be expected by users and could be a bit confusing in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we either disable this warning or disable export for unsupported cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is for any exception that appeared during conversion. After seeing this message, the user can create an issue for a particular model/tokenizer support.
I also prefer to tell the user that the tokenizer is not converted, rather than silently not including it without warning, so the lack of a (de)tokenizer model won't be a surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree if the export of the tokenizer was an explicit choice of the user (current --convert-tokenizer
) but as this PR makes it default I think having such warning can be an issue as it makes it look like the export failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the log level to debug, the message won't be visible by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @apaniukov
Couple of tests are failing, will merge once fixed, let me know if you need any help @apaniukov |
Now that #618 is merged, the tokenizer won't be exported by default when hybrid quantization is applied for SD models I'm fine with merging it now and for us to add this in a following PR (with #580 (comment)) what do you think @apaniukov ? |
No problem, let's do that and I will create a new PR with the fixes we discussed here. |
Perfect, will merge it now, thanks @apaniukov ! |
What does this PR do?
Make tokenizer conversion the default behaviour.
--disable-convert-tokenizer
option.--convert-tokenizer
option.Covered scenarios for different OpenVINO Tokenizers \ OpenVINO combinations:
The tokenizer conversion adds a couple of seconds and the resulting files are a couple of megabytes. For opt-125m (no pytorch weights download), the conversion without tokenizer took 13.340s compared to 13.969s with tokenizer conversion, which is less than 5%. If we take weights downloading time into account, the percentage is even lower.
The resulting folder size increases from 482 MB to 484 MB.
Before submitting