-
Notifications
You must be signed in to change notification settings - Fork 19
HF Tokenizers #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF Tokenizers #11
Conversation
|
Please use lintunner to format the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding the hf tokenizer, and refactoring the code base!
|
Please address @larryliu0820 's comment, and fix the submodule issue in CI before merging it! |
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
I've added this in a new include/detail directory and used tokenizers::detail as the namespace. This base class should not be part of the public interface, but it needs to be in the include tree since it will be used by multiple public interface classes. Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Programatically discover unit tests to build * Use target_compile_definitions to set RESOURCES_PATH and avoid env vars * Add unit tests to ctest so that they can be run programmatically * Run all unit tests in CI with ctest Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This code is available under the MIT license which is retained in the files Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a pratial implementation of the pre tokenizers from the HF project. Others can be added by extending the set of derived classes and the factory. Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This still doesn't implement the post_processor portion of the HF tokenizers library Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
c1776a3 to
681df89
Compare
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is a cleaner designation as a third-party asset and avoids needing to carefully read the comment (plus ignore it for linting). Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
Comments addressed! |
|
Thank you! Please make it “Ready for review” so that it can be merged. Also please update the submodule commit hash in torchchat after this PR is merged. |
Branch: HFTokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
@larryliu0820 Ok, officially ready for review. The linux build was failing with undefined symbols for |
Description
This PR introduces a collection of new capabilities in support of Huggingface tokenizer models. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.
Additions
Dependencies
tokenizer.jsonfiles from HF tokenizersunicode.[h|cpp]andunicode-data.[h|cpp]fromllama.cpp. These implement the full set of transformations needed to go to/from the byte-encoded space needed by theByteLevelfunctionality.const std::string&rather than astring_view(re2::StringPiece), so we need to copy from the view to a string before calling thellama.cppfunctions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.third_partysince this was a slim subset of the fullllama.cppproject.Structure Changes
include/detaildirectory to hold headers that are needed by the public interface, but should not be treated as part of the public interface (in particularbpe_tokenizer_base.h)tokenizers::detailnamespace for all code that lives belowinclude/detail(and corresponding src files)src/includefor implementation-only headerstoolssubdir to hold binary tool builds (tools/tokenize_tool)Code Changes
PreTokenizerandTokenDecoderaligned with the corresponding concepts in the Rust codebaseTiktokeninto a base implementation of BPE (BPETokenizerBase) with hooks for the specifics of theTiktokenmodelsHFTokenizeras a second derivedBPETokenizerBaseimplementation which adds aPreTokenizer/TokenDecoderto the input/output logictokenize_toolto sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?Build and Test
RESOURCES_PATHto avoid relying on the environmentctestctestinstead of invoking test binaries directlytest_tiktoken.cpp->test_tiktokenrather thantiktoken_test)pre_tokenizerstack, but nothing else (yet!)Testing
Build setup
Unit testing
Spot testing
I also did spot testing with several tokenizer models:
Stil TODO
tiktokenhard-coded special tokens)Future work
PreTokenizersandTokenDecodersthat are not yet supported. There's also the wholePostProcessorssuite as well. A good test case for this would be thetokenizersversion of thellama3.1tokenizer (here)