fix: ACPF profile build for duplicate-byte tokenizers (Gemma)#9
Conversation
Selecting a Gemma model aborted in-app profile generation with: Profile self-check failed: - [triePresence] token 239 reached state 2 but terminal=Optional(249732) Gemma's vocabulary contains duplicate tokens: distinct ids whose raw bytes are byte-for-byte identical. The ACPF prefix trie is keyed purely on bytes and stores one terminal_token_id per node, so duplicates collide on a single node and only one id can be its terminal. ProfileSelfCheck.checkTriePresence asserted exact identity (terminal == id) for every non-excluded token, which is unsatisfiable when two non-excluded tokens share bytes; the same wrong assumption sat in the trie-state MmapAutocompleteProfile.tokenAllowed(_:in:). Treat the trie as a byte oracle: walking a non-excluded token's bytes must reach a terminal node, and a different stored terminal is accepted only when its bytes are byte-for-byte identical. tokenAllowed(_:in:) applies the same byte-equality rule but first rejects ids the trie builder excludes (base .excluded flag), so a non-excluded duplicate cannot make an excluded token admissible. No schema or writer change; the byte-based runtime admissibility path (tokenAllowed(_:afterRequiredPrefix:)) and existing profiles are unaffected. Tests: DuplicateTokenTrieTests covers accepted duplicates, the shared terminal node, and a rejected excluded duplicate. ProfileSelfCheck and the strict TriePresenceTests remain green (TokenProfiles: 69 tests).
|
Review this pull request. |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
To use Codex here, create an environment for this repo. |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Yes @johnbean393 this fixes #1. 😉 |
|
Of course; will probably release in the next 24 hours |
Problem
Selecting a Gemma model aborts in-app ACPF profile generation:
A failed build intentionally leaves no usable artifact (ADR-052), so the model can't be used after selection.
Root cause
Gemma's vocabulary contains duplicate tokens — distinct ids whose raw bytes are byte-for-byte identical (here
239and249732). The ACPF prefix trie (ADR-009) is keyed purely on bytes and stores a singleterminal_token_idper node, so duplicates collide on one node and only one id can be its terminal.ProfileSelfCheck.checkTriePresenceasserted exact identity (terminal == id) for every non-excluded token, which is unsatisfiable when two non-excluded tokens share bytes. The same wrong assumption sat latent in the trie-stateMmapAutocompleteProfile.tokenAllowed(_:in:).Fix
Treat the trie as a byte oracle, not a token-id map:
checkTriePresence: walking a non-excluded token's bytes must reach a terminal node; a different stored terminal is accepted only when its bytes are byte-for-byte identical (a genuine duplicate). A non-terminal node, or a terminal whose bytes differ, is still a hard failure.tokenAllowed(_:in:): the same byte-equality rule, but it first rejects ids the trie builder would exclude (base.excludedflag, matchingACPFWriter.buildAndCompactTrie) so a non-excluded duplicate can't make a deliberately excluded token admissible.No schema or writer change. The on-disk format, the byte-based runtime admissibility path (
tokenAllowed(_:afterRequiredPrefix:), used by the decoder), and the per-recordtrieTerminalfield are untouched, so existing profiles keep loading.Tests
DuplicateTokenTrieTestscovers:tokenAllowed(_:in:),swift test --package-path Packages/TokenProfiles→ 69 tests, 0 failures (the strictTriePresenceTestsstill guards the no-duplicate case).Decision recorded in
docs/05-decisions.md(ADR-059).