Skip to content

fix: ACPF profile build for duplicate-byte tokenizers (Gemma)#9

Merged
johnbean393 merged 1 commit into
johnbean393:mainfrom
raphaelbarreiros:fix/duplicate-byte-token-self-check
Jun 1, 2026
Merged

fix: ACPF profile build for duplicate-byte tokenizers (Gemma)#9
johnbean393 merged 1 commit into
johnbean393:mainfrom
raphaelbarreiros:fix/duplicate-byte-token-self-check

Conversation

@raphaelbarreiros
Copy link
Copy Markdown
Contributor

Problem

Selecting a Gemma model aborts in-app ACPF profile generation:

Profile self-check failed:
- [triePresence] token 239 reached state 2 but terminal=Optional(249732)

A failed build intentionally leaves no usable artifact (ADR-052), so the model can't be used after selection.

Root cause

Gemma's vocabulary contains duplicate tokens — distinct ids whose raw bytes are byte-for-byte identical (here 239 and 249732). The ACPF prefix trie (ADR-009) is keyed purely on bytes and stores a single terminal_token_id per node, so duplicates collide on one node and only one id can be its terminal.

ProfileSelfCheck.checkTriePresence asserted exact identity (terminal == id) for every non-excluded token, which is unsatisfiable when two non-excluded tokens share bytes. The same wrong assumption sat latent in the trie-state MmapAutocompleteProfile.tokenAllowed(_:in:).

Fix

Treat the trie as a byte oracle, not a token-id map:

  • checkTriePresence: walking a non-excluded token's bytes must reach a terminal node; a different stored terminal is accepted only when its bytes are byte-for-byte identical (a genuine duplicate). A non-terminal node, or a terminal whose bytes differ, is still a hard failure.
  • tokenAllowed(_:in:): the same byte-equality rule, but it first rejects ids the trie builder would exclude (base .excluded flag, matching ACPFWriter.buildAndCompactTrie) so a non-excluded duplicate can't make a deliberately excluded token admissible.

No schema or writer change. The on-disk format, the byte-based runtime admissibility path (tokenAllowed(_:afterRequiredPrefix:), used by the decoder), and the per-record trieTerminal field are untouched, so existing profiles keep loading.

Tests

DuplicateTokenTrieTests covers:

  • the self-check accepting genuine duplicate-byte tokens,
  • both duplicates being admissible via tokenAllowed(_:in:),
  • the single shared terminal node,
  • an excluded duplicate being rejected (suppression preserved).

swift test --package-path Packages/TokenProfiles → 69 tests, 0 failures (the strict TriePresenceTests still guards the no-duplicate case).

Decision recorded in docs/05-decisions.md (ADR-059).

Selecting a Gemma model aborted in-app profile generation with:

  Profile self-check failed:
  - [triePresence] token 239 reached state 2 but terminal=Optional(249732)

Gemma's vocabulary contains duplicate tokens: distinct ids whose raw bytes
are byte-for-byte identical. The ACPF prefix trie is keyed purely on bytes
and stores one terminal_token_id per node, so duplicates collide on a single
node and only one id can be its terminal. ProfileSelfCheck.checkTriePresence
asserted exact identity (terminal == id) for every non-excluded token, which
is unsatisfiable when two non-excluded tokens share bytes; the same wrong
assumption sat in the trie-state MmapAutocompleteProfile.tokenAllowed(_:in:).

Treat the trie as a byte oracle: walking a non-excluded token's bytes must
reach a terminal node, and a different stored terminal is accepted only when
its bytes are byte-for-byte identical. tokenAllowed(_:in:) applies the same
byte-equality rule but first rejects ids the trie builder excludes (base
.excluded flag), so a non-excluded duplicate cannot make an excluded token
admissible. No schema or writer change; the byte-based runtime admissibility
path (tokenAllowed(_:afterRequiredPrefix:)) and existing profiles are
unaffected.

Tests: DuplicateTokenTrieTests covers accepted duplicates, the shared
terminal node, and a rejected excluded duplicate. ProfileSelfCheck and the
strict TriePresenceTests remain green (TokenProfiles: 69 tests).
@johnbean393
Copy link
Copy Markdown
Owner

@codex

Review this pull request.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@johnbean393
Copy link
Copy Markdown
Owner

@codex

Does this fix issue #1?

#1

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@johnbean393
Copy link
Copy Markdown
Owner

@codex

Does this fix issue #1?

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@johnbean393 johnbean393 merged commit fd2d480 into johnbean393:main Jun 1, 2026
@raphaelbarreiros
Copy link
Copy Markdown
Contributor Author

Yes @johnbean393 this fixes #1. 😉
Do you mind releasing this fix? Nobody's able to use Gemma 4 models without it. Thanks!

@johnbean393
Copy link
Copy Markdown
Owner

@raphaelbarreiros

Of course; will probably release in the next 24 hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants