Skip to content

fix(tokenizer): correctness and robustness fixes for cache and streaming#474

Merged
slin1237 merged 2 commits intomainfrom
slin/tokenizer-fixes
Feb 19, 2026
Merged

fix(tokenizer): correctness and robustness fixes for cache and streaming#474
slin1237 merged 2 commits intomainfrom
slin/tokenizer-fixes

Conversation

@slin1237
Copy link
Collaborator

@slin1237 slin1237 commented Feb 19, 2026

Summary

  • L0 cache: fix add_special_tokens cache poisoning — L0 was keyed only on input string, ignoring the flag. Same input encoded with different flag values would return stale/wrong results. Now uses two separate DashMaps (one per flag value) to keep lookups zero-allocation on the hot path.
  • L1 cache: fix memory accounting on entry replacementDashMap::insert replacing an existing entry was unconditionally adding size_bytes to current_memory, causing it to drift upward over time and trigger premature evictions.
  • DecodeStream: fix UTF-8 char boundary panicnew_text[prefix_text.len()..] panics when prefix_text.len() falls mid-codepoint. This happens with byte-fallback tokenizers where adding context merges partial bytes into multi-byte characters. Now walks backward to the nearest valid is_char_boundary.
  • Factory: unify OpenAI model detection + add fallback in blocking path — Extracts is_likely_openai_model() to deduplicate the async/blocking paths. The blocking path previously hard-errored on tiktoken failure instead of falling back to HuggingFace.
  • Remove Deref<Target=Arc<dyn Tokenizer>> — well-known anti-pattern causing method resolution confusion.
  • Drop unused depsbytemuck, lru, parking_lot (leftover from old cache impl).

Test plan

  • All 88 tokenizer lib tests pass (cargo test -p llm-tokenizer --lib)
  • New test test_add_special_tokens_flag_separates_entries covers L0 flag separation
  • New test test_decode_stream_multibyte_char_boundary reproduces the panic without the fix

Summary by CodeRabbit

  • Bug Fixes

    • Prevented crashes when streaming decoded text that crosses multi-byte character boundaries.
  • Performance Improvements

    • Improved token caching by separating special-token vs non-special-token storage and reworked eviction for more stable hit rates.
    • More accurate memory accounting during cache insertions to improve eviction behavior.
  • Behavior

    • Improved model-name detection and clearer fallback handling when preferred tokenizers aren't available.
  • Tests

    • Added regression tests for multi-byte streaming and cache behavior.

@github-actions github-actions bot added tokenizer Tokenizer related changes dependencies Dependency updates labels Feb 19, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the correctness and robustness of the tokenizer library by addressing several critical issues across its caching mechanisms, streaming decoding, and factory initialization. It ensures more reliable cache behavior, prevents potential panics during text decoding, and improves the flexibility of tokenizer creation, ultimately leading to a more stable and predictable system.

Highlights

  • L0 Cache Correctness: The L0 cache now correctly distinguishes entries based on the add_special_tokens flag, preventing cache poisoning where the same input string with different flag values could return stale results. This is achieved by using two separate DashMaps.
  • L1 Cache Memory Accounting Fix: Memory accounting in the L1 cache was fixed to accurately reflect memory usage when existing entries are replaced, preventing current_memory from drifting upward and causing premature evictions.
  • DecodeStream Robustness: The DecodeStream was made more robust by preventing panics that occurred when prefix_text.len() fell mid-codepoint in a UTF-8 character. It now correctly walks backward to the nearest valid character boundary.
  • Tokenizer Factory Improvements: OpenAI model detection logic was unified into a dedicated function, and a fallback mechanism to HuggingFace was added for the blocking path if Tiktoken initialization fails for an OpenAI model.
  • Code Clean-up: The Deref<Target=Arc<dyn Tokenizer>> implementation was removed as it is considered an anti-pattern causing method resolution confusion. Additionally, unused dependencies (bytemuck, lru, parking_lot) were removed.
Changelog
  • tokenizer/Cargo.toml
    • Removed unused dependencies: bytemuck, lru, and parking_lot.
  • tokenizer/src/cache/l0.rs
    • Refactored L0 cache to use two DashMaps, one for each add_special_tokens flag value, to ensure correct caching.
    • Updated new, get, insert, insert_arc, len, is_empty, and clear methods to operate with the dual-map structure.
    • Added a new test test_add_special_tokens_flag_separates_entries to verify the separation of cache entries based on the flag.
  • tokenizer/src/cache/l1.rs
    • Corrected memory accounting logic in the insert method to properly adjust current_memory when an existing entry is replaced.
  • tokenizer/src/cache/mod.rs
    • Modified CachedTokenizer's encode method to pass the add_special_tokens flag to L0 cache get and insert calls.
  • tokenizer/src/factory.rs
    • Extracted common OpenAI model detection logic into a new is_likely_openai_model function.
    • Applied the is_likely_openai_model function in both asynchronous and blocking tokenizer creation paths.
    • Implemented a fallback to HuggingFace tokenizer creation in the blocking path if TiktokenTokenizer::from_model_name fails for an identified OpenAI model.
  • tokenizer/src/lib.rs
    • Removed the Deref implementation for the Tokenizer struct to resolve potential method resolution issues.
  • tokenizer/src/stream.rs
    • Implemented logic in DecodeStream::step to find the nearest valid UTF-8 character boundary when slicing text, preventing panics on partial multi-byte characters.
  • tokenizer/src/tests.rs
    • Added a new regression test test_decode_stream_multibyte_char_boundary to specifically test and confirm the fix for UTF-8 character boundary panics in DecodeStream.
Activity
  • All 88 existing tokenizer library tests passed, ensuring no regressions.
  • A new test, test_add_special_tokens_flag_separates_entries, was introduced to validate the L0 cache's new behavior regarding the add_special_tokens flag.
  • A new regression test, test_decode_stream_multibyte_char_boundary, was added to confirm the fix for UTF-8 character boundary panics in the DecodeStream.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: feb312e17a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +252 to +253
self.current_memory
.fetch_sub(old_size - new_size, Ordering::Relaxed);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent memory counter underflow on concurrent replacements

The replacement path subtracts old_size - new_size with fetch_sub, but current_memory is updated outside the map insertion critical section. With concurrent inserts of the same hash, a replacing thread can run this subtraction before the original insert's fetch_add, and when new_size < old_size the u64 counter wraps to a huge value. That can make memory stats and eviction decisions permanently incorrect under concurrent mixed-size replacements.

Useful? React with 👍 / 👎.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several important correctness and robustness fixes. The L0 cache poisoning issue is resolved by keying on the add_special_tokens flag, the L1 cache memory accounting is corrected, and a panic in DecodeStream due to UTF-8 boundaries is fixed. The factory functions are also improved for consistency. Overall, these are great improvements. I've found one critical issue in the new L0 cache eviction logic that could lead to the cache size exceeding its limit, and I've provided a suggested fix.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Refactored L0 to maintain two per-flag maps keyed by add_special_tokens and updated all L0 call sites to pass the flag; adjusted L1 memory accounting on replacements; fixed UTF‑8 slicing in streaming decode and added a regression test; extracted OpenAI-model detection helper; removed Tokenizer Deref; removed three workspace deps from Cargo.toml.

Changes

Cohort / File(s) Summary
Dependency Management
tokenizer/Cargo.toml
Removed three workspace dependencies: bytemuck (with derive), lru, and parking_lot.
L0 Cache Refactoring
tokenizer/src/cache/l0.rs
Replaced single map with map_plain and map_special; added map_for, per-map capacity, maybe_evict eviction across maps, updated new, get, insert, insert_arc, len, is_empty, clear; tests updated for dual-map semantics and concurrency.
Cache Integration
tokenizer/src/cache/mod.rs
Updated L0 call sites to include add_special_tokens in lookups and insertions (full-tokenization, prefix merging, end-of-encode caching) and adjusted L1 boundary insert calls.
L1 Cache Memory Accounting
tokenizer/src/cache/l1.rs
insert_at_boundaries now adjusts memory accounting by the size delta when an insertion replaces an existing shard entry (instead of always adding full size).
Factory Refactoring
tokenizer/src/factory.rs
Added private is_likely_openai_model helper and replaced inline OpenAI/GPT name checks; added debug logging for Tiktoken fallback in blocking path.
API Simplification
tokenizer/src/lib.rs
Removed the Deref impl for Tokenizer while retaining From<Arc<dyn traits::Tokenizer>>.
Streaming & Tests
tokenizer/src/stream.rs, tokenizer/src/tests.rs
Adjusted DecodeStream::step to back up to the nearest UTF‑8 char boundary before slicing; added test_decode_stream_multibyte_char_boundary with a mock tokenizer to prevent panics on multibyte splits.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Suggested labels

tests

Suggested reviewers

  • CatherineSue
  • key4ng
  • whybeyoung

Poem

🐰 Two little caches hop side-by-side,
Flags decide the burrows where tokens hide,
I back up bytes so characters stay whole,
Memory tallies each tiny toll,
A rabbit nods — the code's in stride!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: correctness and robustness fixes for cache and streaming mechanisms in the tokenizer module.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch slin/tokenizer-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

- L0 cache: key by (input, add_special_tokens) to prevent cross-flag
  cache poisoning. Uses two separate DashMaps to keep lookups
  zero-allocation on the hot path.
- L1 cache: fix memory accounting when insert replaces an existing
  entry — prevents current_memory from drifting upward and causing
  premature evictions.
- DecodeStream: fix panic on multi-byte UTF-8 char boundaries when
  prefix_text.len() falls mid-codepoint in new_text (byte-fallback
  tokenizers).
- Factory: extract is_likely_openai_model(), unify async/blocking
  detection logic, add tiktoken-to-HuggingFace fallback in blocking
  path.
- Remove Deref<Target=Arc<dyn Tokenizer>> anti-pattern.
- Drop unused deps: bytemuck, lru, parking_lot.

Signed-off-by: Simo Lin <simo.lin@oracle.com>
@slin1237 slin1237 force-pushed the slin/tokenizer-fixes branch from d8676b8 to b268170 Compare February 19, 2026 07:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is matching too
broadly on "ada"; tighten the detection to avoid false positives by requiring
OpenAI-style naming (e.g., no '/' character) or more specific substrings such as
"text-ada" or "ada-" before accepting an ada match. Update
is_likely_openai_model to replace the loose name.contains("ada") check with a
check that the name either contains "text-ada" or starts with "ada" (or has an
"ada-" token) and does not contain '/' so HuggingFace-style names like
"stabilityai/stablelm-ada-1b" won’t falsely match. Ensure you keep the other
token checks intact (gpt-4, gpt-3.5, turbo, davinci, curie, babbage, codex).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is too permissive for
"ada" and causes false positives; tighten the "ada" check by matching only
standalone or token-delimited occurrences (e.g. start/end or separated by -, _,
/) instead of any substring. Update is_likely_openai_model to use a
word-boundary or token-aware pattern for "ada" (for example via a regex like one
that matches (^|[-_/])ada($|[-_/]) or equivalent) so names like "openai/ada" or
"text-ada-001" match but "…-ada-…" arbitrary HF names do not. Ensure other
checks (gpt-4, turbo, davinci, etc.) remain unchanged.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b268170fd8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +65 to +69
while !new_text.is_char_boundary(split_at) && split_at > 0 {
split_at -= 1;
}

let new_text = new_text[split_at..].to_string();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid emitting stale text after backing up split boundary

Backing split_at up to the previous UTF-8 boundary can silently corrupt streamed output when a later decode rewrites bytes that were already emitted in an earlier step. In that case split_at becomes smaller than prefix_text.len(), so the code emits a suffix from an earlier boundary without retracting the already-sent bytes (for example, a stream can emit abc on one step and then 🎉 on the next, yielding abc🎉 instead of ab🎉). This affects byte-fallback tokenizers where additional context merges prior bytes into a multibyte character.

Useful? React with 👍 / 👎.

- L0: maybe_evict now evicts from the larger map instead of only the
  insertion target map, preventing total entries from exceeding
  max_entries when one map is full and the other is empty
- L1: add comment documenting the benign race between shard insert
  and memory counter update on concurrent same-key replacements
- Add test_eviction_across_maps regression test for L0

Signed-off-by: Simo Lin <simo.lin@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant