Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer#1063
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds native pre-tokenizer support for Qwen3.5 regex alternatives that use Unicode mark (\p{M}), avoiding fallback to std::regex for unsupported Unicode property escapes.
Changes:
- Added Qwen3.5-specific regex matcher functions and Unicode helper predicates.
- Registered the new Qwen3.5 matcher patterns in
PreTokenizerWithRegEx::Compile. - Added a regression test covering combining marks, punctuation, digits, whitespace, and newlines.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
operators/tokenizer/bpe_utils.hpp |
Adds Qwen3.5 regex matcher implementations and pattern registration. |
test/pp_api_test/test_tokenizer_impl.cc |
Adds a tokenizer regex test for Qwen3.5-style \p{M} handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sayanshaw24
approved these changes
May 15, 2026
Collaborator
sayanshaw24
left a comment
There was a problem hiding this comment.
looks great, thanks for adding this!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds hand-coded regex matchers for Qwen3.5's tokenizer pre-tokenization patterns that use
\p{M}(Unicode Mark category). Without this change, these patterns fall through tostd::regex, which does not support Unicode property escapes and crashes at runtime.Problem
Qwen3.5 is the only model family whose tokenizer regex includes
\p{M}. Its full pre-tokenizer regex:Two sub-patterns contain
\p{M}:[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+?[^\s\p{L}\p{M}\p{N}]+[\r\n]*The remaining 4 sub-patterns already have existing matchers (LLAMA3, GPT2).
Changes
operators/tokenizer/bpe_utils.hppIsLM()helper — matches[\p{L}\p{M}]NotLMNZ()helper — matches[^\s\p{L}\p{M}\p{N}]Match_Qwen35_Pattern_1()— implements[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+Match_Qwen35_Pattern_2()— implements?[^\s\p{L}\p{M}\p{N}]+[\r\n]*Compile()lookup table (before shorter LLAMA3 patterns to avoid shadowing)test/pp_api_test/test_tokenizer_impl.ccQwen35RegexTest— compiles the full Qwen3.5 regex and verifies tokenization of text with combining marks (e.g.,caféusing U+0301), punctuation, digits, and newlines