Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tokenizers #687

Closed
wants to merge 142 commits into from
Closed

Conversation

apaniukov
Copy link
Contributor

@apaniukov apaniukov commented Jul 17, 2023

Details:

This PR extends OV Opset with tokenization-related operations

  • Check SentencePiece MUSE model works
  • Legal Check

Ticket:

slyalin and others added 30 commits April 26, 2023 03:07
…h and without string support in OV core. Moved StringTensorUnpack and reworked it to be aligned with the new approach. Reworked sentece piece op and translation code to be compatible with several variants of string tensor representation and the plugin wrapping hack.
…ranch to contrib in form compatible with both master and the branch with string tensors support. Added CaseFoldUTF8 from that branch.
…pty constants, register StringTensorPack and StringTensorUnpack as OV operations to be able to read IRs with those operations
…den Const translator for TF to intercept string constants
…r conditional compilation based on available features in OpenVINO
…combination of WordpieceTokenizeWithOffsets and LookupTableFindV2 from TensorFlow
…ute initialization optional (needed for core.make_node)
…n and RegexSplit based on paddle fast_tokenizer lib. Limited implementation, not all of the features of ops and TF translated ops are implemented.
… necessary steps to complete HF bert preprocessing conversion (not validated)
…kenizer and main model is fixed partially (still produces topologically incorrect model)
…uts, now Bert and its tokenizer are connected together correctly
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@apaniukov apaniukov force-pushed the tokenizer-fix-decode branch 2 times, most recently from 5c3b656 to d34d401 Compare November 20, 2023 17:30
@apaniukov
Copy link
Contributor Author

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see Windows compiles now. But ov_tokenizer.init_extension() fails for me:

import os
import sys
import ov_tokenizer

if hasattr(os, "add_dll_directory"):
    for path in os.environ.get("PATH", "").split(";"):
        if os.path.isdir(path):
            os.add_dll_directory(path)
ov_tokenizer.init_extension(sys.argv[1])
py llm/cpp/convert_tokenizers.py c:/Users/vzlobin/r/openvino.genai/build/thirdparty/openvino_contrib/modules/custom_operations/user_ie_extensions/Release/user_ov_extensions.dll C:\Users\vzlobin\r\tiny-llama-fast-tokenizer
Traceback (most recent call last):
  File "C:\Users\vzlobin\r\openvino.genai\llm\cpp\convert_tokenizers.py", line 9, in <module>
    ov_tokenizer.init_extension(sys.argv[1])
  File "C:\Users\vzlobin\r\openvino.genai\thirdparty\openvino_contrib\modules\custom_operations\user_ie_extensions\tokenizer\python\ov_tokenizer\node_factory.py", line 21, in init_extension
    factory.add_extension(extension_path)
  File "C:\Users\vzlobin\Downloads\w_openvino_toolkit_windows_2023.2.0.13089.cfd42bd2cb0_x86_64\python\openvino\runtime\utils\node_factory.py", line 118, in add_extension
    self.factory.add_extension(lib_path)
RuntimeError: Cannot load library 'c:/Users/vzlobin/r/openvino.genai/build/thirdparty/openvino_contrib/modules/custom_operations/user_ie_extensions/Release/user_ov_extensions.dll': 126 from cwd: C:\Users\vzlobin\r\openvino.genai

@ilya-lavrenov
Copy link
Contributor

/azp run openvino_contrib-mac

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ilya-lavrenov
Copy link
Contributor

Replaced by #767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CI OpenVINO public CI category: custom operations OpenVINO Runtime Extension with custom operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure: Unsupported Python version for Windows
7 participants