Convert Tokenizers By Default #580

apaniukov · 2024-02-29T11:37:53Z

What does this PR do?

Make tokenizer conversion the default behaviour.

Add --disable-convert-tokenizer option.
Add a warning for the old --convert-tokenizer option.
Add detailed advanced check for tokenizers version compatibility and detailed instructions for troubleshooting.

Covered scenarios for different OpenVINO Tokenizers \ OpenVINO combinations:

Tokenizers \ OpenVINO	Release PyPI	Pre-release Simple PyPI	-nightly PyPI	Archive
Release PyPI	+	+	+	+
Pre-release Simple PyPI	+	+	+	+
Build				+

The tokenizer conversion adds a couple of seconds and the resulting files are a couple of megabytes. For opt-125m (no pytorch weights download), the conversion without tokenizer took 13.340s compared to 13.969s with tokenizer conversion, which is less than 5%. If we take weights downloading time into account, the percentage is even lower.
The resulting folder size increases from 482 MB to 484 MB.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-02-29T11:42:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eaidova · 2024-03-05T08:16:35Z

it seems useless until we need to install openvino tokenizers as extra explicitly, please consider including it in openvino extra

slyalin · 2024-03-05T14:38:20Z

it seems useless until we need to install openvino tokenizers as extra explicitly, please consider including it in openvino extra

@apaniukov, please add openvino_tokenizer in openvino extra in this PR.

optimum/intel/utils/import_utils.py

Add check for installed openvino-nightly package as well. Improve compatibility messages. Check is OpenVINO Tokenizers is available only if tokenizer is exported.

helena-intel

I created comments about the error messages as I was testing this, but there are quite a few things that can go wrong or cause confusion. IMO it is a better user experience to just show one line like "OpenVINO tokenizer model could not be exported. This does not impact model export. See [documentation url] for more info" and then explain there how to make sure all versions are in sync.

optimum/intel/utils/import_utils.py

apaniukov · 2024-04-03T15:41:46Z

@echarlaix could you merge the PR? The failed IPEX test is not connected with it.

echarlaix · 2024-04-03T17:20:51Z

tests/openvino/test_exporters_cli.py

@@ -122,7 +122,7 @@ def test_exporters_cli(self, task: str, model_type: str):
    def test_exporters_cli_tokenizers(self, task: str, model_type: str):


we shoudl remove :

- @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available")

Removed and add logic checks for "Test openvino-nightly" stage, where the libraries are not compatible.

optimum/intel/utils/import_utils.py

echarlaix · 2024-04-09T15:32:40Z

optimum/exporters/openvino/__main__.py

@@ -327,7 +320,7 @@ class StoreAttr(object):
        **kwargs_shapes,
    )

-    if convert_tokenizer:
+    if convert_tokenizer and is_openvino_tokenizers_available():


The tokenizer export results in additional files being included which could be confusing for users, can this be moved to to an openvino_tokenizers directory ?

Suggested change

if convert_tokenizer and is_openvino_tokenizers_available():

if convert_tokenizer and is_openvino_tokenizers_available():

output = Path(output) / "openvino_tokenizers"

This is similar to the original tokenizer, so we replicate an existing pattern here.

I'm thinking about something similar than what's done for SD models is this something that sounds reasonable to you? I'm fine with adding it in a following PR if you prefer not including it in this one

Yes, I can do that in a separate PR.

thanks a lot @apaniukov !

echarlaix · 2024-04-09T15:35:57Z

tests/openvino/test_exporters_cli.py

+            if not is_openvino_tokenizers_available():
+                self.assertTrue(
+                    "OpenVINO Tokenizers is not available." in output
+                    or "OpenVINO and OpenVINO Tokenizers versions are not binary compatible." in output,
+                    msg=output,
+                )
+                return


would prefer to have it removed to make sure this is always tested (we should always have a compatible version of openvino / openvino-tokenizer when testing)

Suggested change

if not is_openvino_tokenizers_available():

self.assertTrue(

"OpenVINO Tokenizers is not available." in output

or "OpenVINO and OpenVINO Tokenizers versions are not binary compatible." in output,

msg=output,

)

return

This is added because of this test:

optimum-intel/.github/workflows/test_openvino.yml

Line 39 in a48e0ca

- name: Test openvino-nightly

It deletes openvino and installs openvino-nightly, which should be incompatible with the installed tokenizer version, replicating the incompatibility scenario.

echarlaix · 2024-04-09T15:38:21Z

tests/openvino/test_exporters_cli.py

@@ -118,21 +118,23 @@ def test_exporters_cli(self, task: str, model_type: str):
        for arch in SUPPORTED_ARCHITECTURES
        if not arch[0].endswith("-with-past") and not arch[1].endswith("-refiner")
    )
-    @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available")
    def test_exporters_cli_tokenizers(self, task: str, model_type: str):
        with TemporaryDirectory() as tmpdir:
            output = subprocess.check_output(


we should also check whether no error was raised here

With

optimum-cli export openvino --model hf-internal-testing/tiny-random-t5 --task text2text-generation ov_model

I'm getting the following error :

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'

Would like to have this fixed before making the tokenizer export default / merging this PR

It is not a bug, we don't support the Unigram model from tokenizer.json yet, only from sentencepiece model file. If you use a repository that has such a file, the tokenizer will be converted.

I added one more check for the vocab file here: openvinotoolkit/openvino_tokenizers#116
This will transform

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'

into:

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: Cannot convert tokenizer of this type without `.model` file.

The issue is that the original tokenizer has info about .model file in the tokenizer object, but it does not have it in practice:

I would argue that this is an edge case of a test repo and most tokenizers with info about the .model file have it, so it should not block the merge.

I think it's reasonable for not all cases to be supported, but for these cases we should disable the tokenizer export and not throw an error as it won't be expected by users and could be a bit confusing in my opinion

Can we either disable this warning or disable export for unsupported cases ?

This warning is for any exception that appeared during conversion. After seeing this message, the user can create an issue for a particular model/tokenizer support.
I also prefer to tell the user that the tokenizer is not converted, rather than silently not including it without warning, so the lack of a (de)tokenizer model won't be a surprise.

I'd agree if the export of the tokenizer was an explicit choice of the user (current --convert-tokenizer) but as this PR makes it default I think having such warning can be an issue as it makes it look like the export failed

I changed the log level to debug, the message won't be visible by default.

thanks @apaniukov

echarlaix · 2024-04-17T13:42:29Z

Couple of tests are failing, will merge once fixed, let me know if you need any help @apaniukov

echarlaix · 2024-04-18T10:27:17Z

Now that #618 is merged, the tokenizer won't be exported by default when hybrid quantization is applied for SD models I'm fine with merging it now and for us to add this in a following PR (with #580 (comment)) what do you think @apaniukov ?

apaniukov · 2024-04-18T11:55:16Z

Now that #618 is merged, the tokenizer won't be exported by default when hybrid quantization is applied for SD models I'm fine with merging it now and for us to add this in a following PR (with #580 (comment)) what do you think @apaniukov ?

No problem, let's do that and I will create a new PR with the fixes we discussed here.

echarlaix · 2024-04-18T12:50:26Z

Now that #618 is merged, the tokenizer won't be exported by default when hybrid quantization is applied for SD models I'm fine with merging it now and for us to add this in a following PR (with #580 (comment)) what do you think @apaniukov ?

No problem, let's do that and I will create a new PR with the fixes we discussed here.

Perfect, will merge it now, thanks @apaniukov !

apaniukov mentioned this pull request Feb 29, 2024

Add OpenVINO Tokenizers #513

Merged

3 tasks

apaniukov added 2 commits February 29, 2024 12:49

Convert Tokenizers By Default

6d72fe0

Add Warning to Deprecated Option

ffcae81

eaidova approved these changes Mar 1, 2024

View reviewed changes

apaniukov added 5 commits March 8, 2024 15:41

Update OV Tokenizers Availability Check

eadb210

Move openvino-tokenizers to openvino dependencies

8a22002

Make Style

40f227d

Merge branch 'main' into del-convert-tokenizer-flag

d9e3a0f

Change Imports to Absolute

52e24b5

helena-intel reviewed Mar 14, 2024

View reviewed changes

optimum/intel/utils/import_utils.py Outdated Show resolved Hide resolved

apaniukov added 3 commits March 14, 2024 16:07

Check openvino-nightly compatibility

fd1887f

Add check for installed openvino-nightly package as well. Improve compatibility messages. Check is OpenVINO Tokenizers is available only if tokenizer is exported.

Change model skip explanation

92d42f4

Merge branch 'main' into del-convert-tokenizer-flag

feb12dd

apaniukov changed the title ~~Convert Tokenizers By Default~~ [WiP] Convert Tokenizers By Default Mar 15, 2024

Update OV Tokenizers Availability Check

86a1eca

apaniukov changed the title ~~[WiP] Convert Tokenizers By Default~~ Convert Tokenizers By Default Mar 25, 2024

apaniukov added 2 commits March 25, 2024 17:52

Add Check for OpenVINO Nightly and Archive

9041fe5

Merge branch 'main' into del-convert-tokenizer-flag

277225c

helena-intel reviewed Mar 28, 2024

View reviewed changes

apaniukov added 3 commits March 28, 2024 18:25

Add linux distros compatibility message

3dcaf9e

Address Review Comments

241d265

Address Review Comments

4d3df41

Merge branch 'main' into del-convert-tokenizer-flag

efd6638

echarlaix reviewed Apr 3, 2024

View reviewed changes

apaniukov requested a review from echarlaix April 4, 2024 18:03

Address Review Comments

eb05594

Fix Style

4ed432d

echarlaix reviewed Apr 9, 2024

View reviewed changes

apaniukov requested a review from helena-intel April 15, 2024 11:35

apaniukov added 2 commits April 17, 2024 14:34

Change Warnings to Debug Level

3710884

Merge branch 'main' into del-convert-tokenizer-flag

a5ef7b1

Merge branch 'main' into del-convert-tokenizer-flag

cb2b26f

apaniukov and others added 2 commits April 18, 2024 12:09

Fix Tests Debug Message

80d4c1d

Merge branch 'main' into del-convert-tokenizer-flag

85c925a

echarlaix merged commit 0d943f8 into huggingface:main Apr 18, 2024
10 checks passed

apaniukov added 2 commits April 18, 2024 14:59

Fix Style

f4b3301

Fix Style

934ea22

This was referenced May 2, 2024

OV Tokenizers Leftovers #697

Merged

Remove convert_tokenizer openvinotoolkit/openvino.genai#425

Merged

echarlaix mentioned this pull request May 14, 2024

Add openvino tokenizers directory #707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Tokenizers By Default #580

Convert Tokenizers By Default #580

apaniukov commented Feb 29, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 29, 2024

eaidova commented Mar 5, 2024

slyalin commented Mar 5, 2024

helena-intel left a comment

apaniukov commented Apr 3, 2024

echarlaix Apr 3, 2024

apaniukov Apr 4, 2024

echarlaix Apr 9, 2024

apaniukov Apr 10, 2024

echarlaix Apr 12, 2024

apaniukov Apr 12, 2024

echarlaix Apr 15, 2024

echarlaix Apr 9, 2024

apaniukov Apr 10, 2024 •

edited

Loading

echarlaix Apr 9, 2024

echarlaix Apr 9, 2024

echarlaix Apr 9, 2024

apaniukov Apr 10, 2024

echarlaix Apr 12, 2024

echarlaix Apr 15, 2024

apaniukov Apr 16, 2024

echarlaix Apr 16, 2024

apaniukov Apr 17, 2024

echarlaix Apr 17, 2024

echarlaix commented Apr 17, 2024 •

edited

Loading

echarlaix commented Apr 18, 2024

apaniukov commented Apr 18, 2024

echarlaix commented Apr 18, 2024

		@@ -122,7 +122,7 @@ def test_exporters_cli(self, task: str, model_type: str):
		def test_exporters_cli_tokenizers(self, task: str, model_type: str):

	if convert_tokenizer and is_openvino_tokenizers_available():
	if convert_tokenizer and is_openvino_tokenizers_available():
	output = Path(output) / "openvino_tokenizers"

Convert Tokenizers By Default #580

Convert Tokenizers By Default #580

Conversation

apaniukov commented Feb 29, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Feb 29, 2024

eaidova commented Mar 5, 2024

slyalin commented Mar 5, 2024

helena-intel left a comment

Choose a reason for hiding this comment

apaniukov commented Apr 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apaniukov Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echarlaix commented Apr 17, 2024 • edited Loading

echarlaix commented Apr 18, 2024

apaniukov commented Apr 18, 2024

echarlaix commented Apr 18, 2024

apaniukov commented Feb 29, 2024 •

edited

Loading

apaniukov Apr 10, 2024 •

edited

Loading

echarlaix commented Apr 17, 2024 •

edited

Loading