fix(tokenizer): correctness and robustness fixes for cache and streaming by slin1237 · Pull Request #474 · lightseekorg/smg

slin1237 · 2026-02-19T05:42:17Z

Summary

L0 cache: fix add_special_tokens cache poisoning — L0 was keyed only on input string, ignoring the flag. Same input encoded with different flag values would return stale/wrong results. Now uses two separate DashMaps (one per flag value) to keep lookups zero-allocation on the hot path.
L1 cache: fix memory accounting on entry replacement — DashMap::insert replacing an existing entry was unconditionally adding size_bytes to current_memory, causing it to drift upward over time and trigger premature evictions.
DecodeStream: fix UTF-8 char boundary panic — new_text[prefix_text.len()..] panics when prefix_text.len() falls mid-codepoint. This happens with byte-fallback tokenizers where adding context merges partial bytes into multi-byte characters. Now walks backward to the nearest valid is_char_boundary.
Factory: unify OpenAI model detection + add fallback in blocking path — Extracts is_likely_openai_model() to deduplicate the async/blocking paths. The blocking path previously hard-errored on tiktoken failure instead of falling back to HuggingFace.
Remove Deref<Target=Arc<dyn Tokenizer>> — well-known anti-pattern causing method resolution confusion.
Drop unused deps — bytemuck, lru, parking_lot (leftover from old cache impl).

Test plan

All 88 tokenizer lib tests pass (cargo test -p llm-tokenizer --lib)
New test test_add_special_tokens_flag_separates_entries covers L0 flag separation
New test test_decode_stream_multibyte_char_boundary reproduces the panic without the fix

Summary by CodeRabbit

Bug Fixes
- Prevented crashes when streaming decoded text that crosses multi-byte character boundaries.
Performance Improvements
- Improved token caching by separating special-token vs non-special-token storage and reworked eviction for more stable hit rates.
- More accurate memory accounting during cache insertions to improve eviction behavior.
Behavior
- Improved model-name detection and clearer fallback handling when preferred tokenizers aren't available.
Tests
- Added regression tests for multi-byte streaming and cache behavior.

gemini-code-assist · 2026-02-19T05:42:42Z

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the correctness and robustness of the tokenizer library by addressing several critical issues across its caching mechanisms, streaming decoding, and factory initialization. It ensures more reliable cache behavior, prevents potential panics during text decoding, and improves the flexibility of tokenizer creation, ultimately leading to a more stable and predictable system.

Highlights

L0 Cache Correctness: The L0 cache now correctly distinguishes entries based on the add_special_tokens flag, preventing cache poisoning where the same input string with different flag values could return stale results. This is achieved by using two separate DashMaps.
L1 Cache Memory Accounting Fix: Memory accounting in the L1 cache was fixed to accurately reflect memory usage when existing entries are replaced, preventing current_memory from drifting upward and causing premature evictions.
DecodeStream Robustness: The DecodeStream was made more robust by preventing panics that occurred when prefix_text.len() fell mid-codepoint in a UTF-8 character. It now correctly walks backward to the nearest valid character boundary.
Tokenizer Factory Improvements: OpenAI model detection logic was unified into a dedicated function, and a fallback mechanism to HuggingFace was added for the blocking path if Tiktoken initialization fails for an OpenAI model.
Code Clean-up: The Deref<Target=Arc<dyn Tokenizer>> implementation was removed as it is considered an anti-pattern causing method resolution confusion. Additionally, unused dependencies (bytemuck, lru, parking_lot) were removed.

Changelog

tokenizer/Cargo.toml
- Removed unused dependencies: bytemuck, lru, and parking_lot.
tokenizer/src/cache/l0.rs
- Refactored L0 cache to use two DashMaps, one for each add_special_tokens flag value, to ensure correct caching.
- Updated new, get, insert, insert_arc, len, is_empty, and clear methods to operate with the dual-map structure.
- Added a new test test_add_special_tokens_flag_separates_entries to verify the separation of cache entries based on the flag.
tokenizer/src/cache/l1.rs
- Corrected memory accounting logic in the insert method to properly adjust current_memory when an existing entry is replaced.
tokenizer/src/cache/mod.rs
- Modified CachedTokenizer's encode method to pass the add_special_tokens flag to L0 cache get and insert calls.
tokenizer/src/factory.rs
- Extracted common OpenAI model detection logic into a new is_likely_openai_model function.
- Applied the is_likely_openai_model function in both asynchronous and blocking tokenizer creation paths.
- Implemented a fallback to HuggingFace tokenizer creation in the blocking path if TiktokenTokenizer::from_model_name fails for an identified OpenAI model.
tokenizer/src/lib.rs
- Removed the Deref implementation for the Tokenizer struct to resolve potential method resolution issues.
tokenizer/src/stream.rs
- Implemented logic in DecodeStream::step to find the nearest valid UTF-8 character boundary when slicing text, preventing panics on partial multi-byte characters.
tokenizer/src/tests.rs
- Added a new regression test test_decode_stream_multibyte_char_boundary to specifically test and confirm the fix for UTF-8 character boundary panics in DecodeStream.

Activity

All 88 existing tokenizer library tests passed, ensuring no regressions.
A new test, test_add_special_tokens_flag_separates_entries, was introduced to validate the L0 cache's new behavior regarding the add_special_tokens flag.
A new regression test, test_decode_stream_multibyte_char_boundary, was added to confirm the fix for UTF-8 character boundary panics in the DecodeStream.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: feb312e17a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

tokenizer/src/cache/l0.rs

chatgpt-codex-connector · 2026-02-19T05:47:05Z

tokenizer/src/cache/l1.rs

+                    self.current_memory
+                        .fetch_sub(old_size - new_size, Ordering::Relaxed);


Prevent memory counter underflow on concurrent replacements

The replacement path subtracts old_size - new_size with fetch_sub, but current_memory is updated outside the map insertion critical section. With concurrent inserts of the same hash, a replacing thread can run this subtraction before the original insert's fetch_add, and when new_size < old_size the u64 counter wraps to a huge value. That can make memory stats and eviction decisions permanently incorrect under concurrent mixed-size replacements.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces several important correctness and robustness fixes. The L0 cache poisoning issue is resolved by keying on the add_special_tokens flag, the L1 cache memory accounting is corrected, and a panic in DecodeStream due to UTF-8 boundaries is fixed. The factory functions are also improved for consistency. Overall, these are great improvements. I've found one critical issue in the new L0 cache eviction logic that could lead to the cache size exceeding its limit, and I've provided a suggested fix.

tokenizer/src/cache/l0.rs

coderabbitai · 2026-02-19T07:55:25Z

No actionable comments were generated in the recent review. 🎉

📝 Walkthrough

Walkthrough

Refactored L0 to maintain two per-flag maps keyed by add_special_tokens and updated all L0 call sites to pass the flag; adjusted L1 memory accounting on replacements; fixed UTF‑8 slicing in streaming decode and added a regression test; extracted OpenAI-model detection helper; removed Tokenizer Deref; removed three workspace deps from Cargo.toml.

Changes

Cohort / File(s)	Summary
Dependency Management `tokenizer/Cargo.toml`	Removed three workspace dependencies: `bytemuck` (with `derive`), `lru`, and `parking_lot`.
L0 Cache Refactoring `tokenizer/src/cache/l0.rs`	Replaced single map with `map_plain` and `map_special`; added `map_for`, per-map capacity, `maybe_evict` eviction across maps, updated `new`, `get`, `insert`, `insert_arc`, `len`, `is_empty`, `clear`; tests updated for dual-map semantics and concurrency.
Cache Integration `tokenizer/src/cache/mod.rs`	Updated L0 call sites to include `add_special_tokens` in lookups and insertions (full-tokenization, prefix merging, end-of-encode caching) and adjusted L1 boundary insert calls.
L1 Cache Memory Accounting `tokenizer/src/cache/l1.rs`	`insert_at_boundaries` now adjusts memory accounting by the size delta when an insertion replaces an existing shard entry (instead of always adding full size).
Factory Refactoring `tokenizer/src/factory.rs`	Added private `is_likely_openai_model` helper and replaced inline OpenAI/GPT name checks; added debug logging for Tiktoken fallback in blocking path.
API Simplification `tokenizer/src/lib.rs`	Removed the `Deref` impl for `Tokenizer` while retaining `From<Arc<dyn traits::Tokenizer>>`.
Streaming & Tests `tokenizer/src/stream.rs`, `tokenizer/src/tests.rs`	Adjusted `DecodeStream::step` to back up to the nearest UTF‑8 char boundary before slicing; added `test_decode_stream_multibyte_char_boundary` with a mock tokenizer to prevent panics on multibyte splits.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

feat(tokenizer): add tiktoken hub model support and unify chat template API #439 — related changes to tokenizer factory detection and handling for tiktoken/OpenAI-like models.

Suggested labels

tests

Suggested reviewers

CatherineSue
key4ng
whybeyoung

Poem

🐰 Two little caches hop side-by-side,
Flags decide the burrows where tokens hide,
I back up bytes so characters stay whole,
Memory tallies each tiny toll,
A rabbit nods — the code's in stride!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: correctness and robustness fixes for cache and streaming mechanisms in the tokenizer module.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch slin/tokenizer-fixes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- L0 cache: key by (input, add_special_tokens) to prevent cross-flag cache poisoning. Uses two separate DashMaps to keep lookups zero-allocation on the hot path. - L1 cache: fix memory accounting when insert replaces an existing entry — prevents current_memory from drifting upward and causing premature evictions. - DecodeStream: fix panic on multi-byte UTF-8 char boundaries when prefix_text.len() falls mid-codepoint in new_text (byte-fallback tokenizers). - Factory: extract is_likely_openai_model(), unify async/blocking detection logic, add tiktoken-to-HuggingFace fallback in blocking path. - Remove Deref<Target=Arc<dyn Tokenizer>> anti-pattern. - Drop unused deps: bytemuck, lru, parking_lot. Signed-off-by: Simo Lin <simo.lin@oracle.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is matching too
broadly on "ada"; tighten the detection to avoid false positives by requiring
OpenAI-style naming (e.g., no '/' character) or more specific substrings such as
"text-ada" or "ada-" before accepting an ada match. Update
is_likely_openai_model to replace the loose name.contains("ada") check with a
check that the name either contains "text-ada" or starts with "ada" (or has an
"ada-" token) and does not contain '/' so HuggingFace-style names like
"stabilityai/stablelm-ada-1b" won’t falsely match. Ensure you keep the other
token checks intact (gpt-4, gpt-3.5, turbo, davinci, curie, babbage, codex).

tokenizer/src/factory.rs

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tokenizer/src/factory.rs`:
- Around line 266-280: The is_likely_openai_model function is too permissive for
"ada" and causes false positives; tighten the "ada" check by matching only
standalone or token-delimited occurrences (e.g. start/end or separated by -, _,
/) instead of any substring. Update is_likely_openai_model to use a
word-boundary or token-aware pattern for "ada" (for example via a regex like one
that matches (^|[-_/])ada($|[-_/]) or equivalent) so names like "openai/ada" or
"text-ada-001" match but "…-ada-…" arbitrary HF names do not. Ensure other
checks (gpt-4, turbo, davinci, etc.) remain unchanged.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b268170fd8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-19T08:06:28Z

tokenizer/src/stream.rs

+            while !new_text.is_char_boundary(split_at) && split_at > 0 {
+                split_at -= 1;
+            }
+
+            let new_text = new_text[split_at..].to_string();


Avoid emitting stale text after backing up split boundary

Backing split_at up to the previous UTF-8 boundary can silently corrupt streamed output when a later decode rewrites bytes that were already emitted in an earlier step. In that case split_at becomes smaller than prefix_text.len(), so the code emits a suffix from an earlier boundary without retracting the already-sent bytes (for example, a stream can emit abc on one step and then 🎉 on the next, yielding abc🎉 instead of ab🎉). This affects byte-fallback tokenizers where additional context merges prior bytes into a multibyte character.

Useful? React with 👍 / 👎.

- L0: maybe_evict now evicts from the larger map instead of only the insertion target map, preventing total entries from exceeding max_entries when one map is full and the other is empty - L1: add comment documenting the benign race between shard insert and memory counter update on concurrent same-key replacements - Add test_eviction_across_maps regression test for L0 Signed-off-by: Simo Lin <simo.lin@oracle.com>

slin1237 requested a review from CatherineSue as a code owner February 19, 2026 05:42

github-actions bot added tokenizer Tokenizer related changes dependencies Dependency updates labels Feb 19, 2026

chatgpt-codex-connector bot reviewed Feb 19, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

tokenizer/src/cache/l0.rs Outdated Show resolved Hide resolved

slin1237 force-pushed the slin/tokenizer-fixes branch from d8676b8 to b268170 Compare February 19, 2026 07:59

coderabbitai bot requested changes Feb 19, 2026

View reviewed changes

tokenizer/src/factory.rs Show resolved Hide resolved

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 19, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 19, 2026

View reviewed changes

slin1237 force-pushed the slin/tokenizer-fixes branch from b268170 to 1973b6f Compare February 19, 2026 08:10

slin1237 merged commit e0f7a61 into main Feb 19, 2026
24 checks passed

slin1237 deleted the slin/tokenizer-fixes branch February 19, 2026 15:01

This was referenced Feb 19, 2026

feat(gRPC): Add GetTokenizer Proto and Tokenizer Bundle Streaming #470

Merged

fix(tokenizer): chat template error handling + OpenAI o-series detection #506

Merged

This was referenced Mar 10, 2026

fix(tokenizer): prevent unbounded incremental decode on long streamed outputs #696

Open

fix(tokenizer): fix byte-fallback incremental decode dropping CJK characters #710

Open

		self.current_memory
		.fetch_sub(old_size - new_size, Ordering::Relaxed);

Conversation

slin1237 commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slin1237 commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading