fix(tokenizer): correctness and robustness fixes for cache and streaming#474
fix(tokenizer): correctness and robustness fixes for cache and streaming#474
Conversation
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the correctness and robustness of the tokenizer library by addressing several critical issues across its caching mechanisms, streaming decoding, and factory initialization. It ensures more reliable cache behavior, prevents potential panics during text decoding, and improves the flexibility of tokenizer creation, ultimately leading to a more stable and predictable system. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: feb312e17a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self.current_memory | ||
| .fetch_sub(old_size - new_size, Ordering::Relaxed); |
There was a problem hiding this comment.
Prevent memory counter underflow on concurrent replacements
The replacement path subtracts old_size - new_size with fetch_sub, but current_memory is updated outside the map insertion critical section. With concurrent inserts of the same hash, a replacing thread can run this subtraction before the original insert's fetch_add, and when new_size < old_size the u64 counter wraps to a huge value. That can make memory stats and eviction decisions permanently incorrect under concurrent mixed-size replacements.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces several important correctness and robustness fixes. The L0 cache poisoning issue is resolved by keying on the add_special_tokens flag, the L1 cache memory accounting is corrected, and a panic in DecodeStream due to UTF-8 boundaries is fixed. The factory functions are also improved for consistency. Overall, these are great improvements. I've found one critical issue in the new L0 cache eviction logic that could lead to the cache size exceeding its limit, and I've provided a suggested fix.
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughRefactored L0 to maintain two per-flag maps keyed by add_special_tokens and updated all L0 call sites to pass the flag; adjusted L1 memory accounting on replacements; fixed UTF‑8 slicing in streaming decode and added a regression test; extracted OpenAI-model detection helper; removed Tokenizer Deref; removed three workspace deps from Cargo.toml. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
- L0 cache: key by (input, add_special_tokens) to prevent cross-flag cache poisoning. Uses two separate DashMaps to keep lookups zero-allocation on the hot path. - L1 cache: fix memory accounting when insert replaces an existing entry — prevents current_memory from drifting upward and causing premature evictions. - DecodeStream: fix panic on multi-byte UTF-8 char boundaries when prefix_text.len() falls mid-codepoint in new_text (byte-fallback tokenizers). - Factory: extract is_likely_openai_model(), unify async/blocking detection logic, add tiktoken-to-HuggingFace fallback in blocking path. - Remove Deref<Target=Arc<dyn Tokenizer>> anti-pattern. - Drop unused deps: bytemuck, lru, parking_lot. Signed-off-by: Simo Lin <simo.lin@oracle.com>
d8676b8 to
b268170
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is matching too
broadly on "ada"; tighten the detection to avoid false positives by requiring
OpenAI-style naming (e.g., no '/' character) or more specific substrings such as
"text-ada" or "ada-" before accepting an ada match. Update
is_likely_openai_model to replace the loose name.contains("ada") check with a
check that the name either contains "text-ada" or starts with "ada" (or has an
"ada-" token) and does not contain '/' so HuggingFace-style names like
"stabilityai/stablelm-ada-1b" won’t falsely match. Ensure you keep the other
token checks intact (gpt-4, gpt-3.5, turbo, davinci, curie, babbage, codex).
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is too permissive for
"ada" and causes false positives; tighten the "ada" check by matching only
standalone or token-delimited occurrences (e.g. start/end or separated by -, _,
/) instead of any substring. Update is_likely_openai_model to use a
word-boundary or token-aware pattern for "ada" (for example via a regex like one
that matches (^|[-_/])ada($|[-_/]) or equivalent) so names like "openai/ada" or
"text-ada-001" match but "…-ada-…" arbitrary HF names do not. Ensure other
checks (gpt-4, turbo, davinci, etc.) remain unchanged.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b268170fd8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| while !new_text.is_char_boundary(split_at) && split_at > 0 { | ||
| split_at -= 1; | ||
| } | ||
|
|
||
| let new_text = new_text[split_at..].to_string(); |
There was a problem hiding this comment.
Avoid emitting stale text after backing up split boundary
Backing split_at up to the previous UTF-8 boundary can silently corrupt streamed output when a later decode rewrites bytes that were already emitted in an earlier step. In that case split_at becomes smaller than prefix_text.len(), so the code emits a suffix from an earlier boundary without retracting the already-sent bytes (for example, a stream can emit abc on one step and then 🎉 on the next, yielding abc🎉 instead of ab🎉). This affects byte-fallback tokenizers where additional context merges prior bytes into a multibyte character.
Useful? React with 👍 / 👎.
- L0: maybe_evict now evicts from the larger map instead of only the insertion target map, preventing total entries from exceeding max_entries when one map is full and the other is empty - L1: add comment documenting the benign race between shard insert and memory counter update on concurrent same-key replacements - Add test_eviction_across_maps regression test for L0 Signed-off-by: Simo Lin <simo.lin@oracle.com>
b268170 to
1973b6f
Compare
Summary
add_special_tokenscache poisoning — L0 was keyed only on input string, ignoring the flag. Same input encoded with different flag values would return stale/wrong results. Now uses two separateDashMaps (one per flag value) to keep lookups zero-allocation on the hot path.DashMap::insertreplacing an existing entry was unconditionally addingsize_bytestocurrent_memory, causing it to drift upward over time and trigger premature evictions.new_text[prefix_text.len()..]panics whenprefix_text.len()falls mid-codepoint. This happens with byte-fallback tokenizers where adding context merges partial bytes into multi-byte characters. Now walks backward to the nearest validis_char_boundary.is_likely_openai_model()to deduplicate the async/blocking paths. The blocking path previously hard-errored on tiktoken failure instead of falling back to HuggingFace.Deref<Target=Arc<dyn Tokenizer>>— well-known anti-pattern causing method resolution confusion.bytemuck,lru,parking_lot(leftover from old cache impl).Test plan
cargo test -p llm-tokenizer --lib)test_add_special_tokens_flag_separates_entriescovers L0 flag separationtest_decode_stream_multibyte_char_boundaryreproduces the panic without the fixSummary by CodeRabbit
Bug Fixes
Performance Improvements
Behavior
Tests