Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -327,20 +327,31 @@ public final class MmapAutocompleteProfile: AutocompleteProfile {
return TrieState(nodeIndex: node)
}

/// True iff token `id`'s bytes can be walked from `state.nodeIndex` and the walk
/// ends at a node whose `terminal_token_id == id`. The "token is a prefix of
/// required" case is handled by the protocol's `tokenAllowed(_:afterRequiredPrefix:)`.
/// True iff token `id`'s bytes can be walked from `state.nodeIndex` and the walk ends
/// at a terminal node for those bytes. Usually that node's `terminal_token_id == id`;
/// for duplicate-byte tokenizers (e.g. Gemma) the stored terminal may be a different
/// id whose bytes are identical, which still validly completes `id`'s bytes — so we
/// compare bytes, not ids. Excluded ids are never trie members and are rejected up
/// front, so a non-excluded duplicate can't smuggle one back in. The "token is a prefix
/// of required" case is handled by the protocol's `tokenAllowed(_:afterRequiredPrefix:)`.
public func tokenAllowed(_ id: TokenID, in state: TrieState) -> Bool {
guard let raw = rawRecord(for: id) else { return false }
let tokenBytes = bytes(forRaw: raw)
guard !tokenBytes.isEmpty else { return false }
// The trie holds only non-excluded tokens (ACPFWriter.buildAndCompactTrie gates on
// the base `.excluded` flag), so an excluded id is never a member: reject it before
// the byte-equality fallback can match it against a non-excluded duplicate's bytes.
guard !TokenProfileFlags(rawValue: raw.flags).contains(.excluded) else { return false }
var node = state.nodeIndex
for b in tokenBytes {
guard let next = followEdge(from: node, byte: b) else { return false }
node = next
}
let terminal = trieNode(at: node).terminalTokenID
return terminal == Int32(id)
guard terminal >= 0 else { return false }
if terminal == Int32(id) { return true }
// Duplicate token: accept when the stored terminal's bytes match `id`'s bytes.
return withRawBytes(for: TokenID(terminal)) { $0.elementsEqual(tokenBytes) } ?? false
}

/// Advance the trie state by all bytes of token `id`. Returns `nil` if any byte
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,14 @@ public enum ProfileSelfCheck {
}

public static func checkTriePresence(profile: MmapAutocompleteProfile) throws {
// For every non-excluded token, walking its bytes from the trie root must reach
// a node whose `terminal_token_id == id`. This is the writer-correctness
// invariant the brief calls out ("Every non-excluded token appears in the
// prefix trie").
// For every non-excluded token, walking its bytes from the trie root must reach a
// terminal node. Normally that node's terminal id == this token. But some
// tokenizers (e.g. Gemma) contain *duplicate* tokens — distinct ids whose raw
// bytes are byte-for-byte identical. A byte-keyed trie cannot tell them apart:
// they share one node, so only one id can be that node's terminal. That is
// correct (the trie is a byte oracle and both ids decode to the same text), so a
// different terminal is accepted only when its bytes are identical. A non-terminal
// node, or a terminal whose bytes differ, is a genuine writer bug.
for id in 0..<profile.vocabularySize {
let tokenID = TokenID(id)
// `.code` mode is the least restrictive baseline (no newline ban, no emoji
Expand All @@ -113,10 +117,14 @@ public enum ProfileSelfCheck {
guard let state = profile.prefixStart(requiredBytes: bytes) else {
throw Failure(check: "triePresence", detail: "token \(id) bytes not walkable from root")
}
let terminal = profile.terminalTokenID(at: state)
if terminal != tokenID {
guard let terminal = profile.terminalTokenID(at: state) else {
throw Failure(check: "triePresence",
detail: "token \(id) reached state \(state.nodeIndex) but terminal=\(terminal as Any)")
detail: "token \(id) reached non-terminal state \(state.nodeIndex)")
}
// Accept a different terminal only for a true duplicate (identical bytes).
if terminal != tokenID && profile.bytes(for: terminal) != bytes {
throw Failure(check: "triePresence",
detail: "token \(id) reached state \(state.nodeIndex) but terminal=\(terminal) has different bytes")
}
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import AutocompleteCore
import XCTest
@testable import TokenProfiles

/// Regression tests for **duplicate-byte tokenizers** (e.g. Gemma): two distinct token
/// ids whose raw bytes are byte-for-byte identical. A byte-keyed prefix trie cannot give
/// each id its own terminal node — they collide on one node that can only store a single
/// terminal id. The post-build self-check and the trie-state admissibility query must
/// therefore reason about *bytes*, not token-id identity.
///
/// Before the fix, generating a Gemma profile aborted in `BuildProfile.run` with
/// `[triePresence] token 239 reached state 2 but terminal=Optional(249732)`, so the model
/// could not be used after selection.
final class DuplicateTokenTrieTests: XCTestCase {

private static let duplicateBytes = Array("ab".utf8)

/// Builds a minimal profile from `(id, bytes, flags)` specs. Tokens whose `flags`
/// contain `.excluded` are kept out of the prefix trie by the writer.
private func makeProfile(
_ specs: [(TokenID, [UInt8], TokenProfileFlags)]
) throws -> MmapAutocompleteProfile {
let entries = specs.map { id, bytes, flags in
ACPFTokenEntry(
tokenID: id,
bytes: bytes,
flags: flags,
staticBias: 0,
displayWidth: bytes.count,
tokenType: 0
)
}
let digest = ACPFTokenizerDigest.digest(vocabSize: specs.count) { id in
specs[Int(id)].1
}
let input = ACPFProfileInput(
modelFamily: "dup-v\(specs.count)",
vocabSize: specs.count,
tokenizerDigest: digest,
entries: entries,
ggufMetadataDigest: "dup-gguf",
builderHost: "dup-host",
buildTimestamp: Date(timeIntervalSince1970: 1_716_000_000)
)
return try MmapAutocompleteProfile(data: ACPFWriter.encode(input))
}

/// id 0 = "a", id 1 = "b", id 2 = "ab", id 3 = "ab" (a byte-for-byte duplicate of id
/// 2). All non-excluded, so all four are inserted into the trie; ids 2 and 3 collide
/// on the single "ab" node.
private func makeDuplicateProfile() throws -> MmapAutocompleteProfile {
try makeProfile([
(0, Array("a".utf8), TokenProfileFlags()),
(1, Array("b".utf8), TokenProfileFlags()),
(2, Self.duplicateBytes, TokenProfileFlags()),
(3, Self.duplicateBytes, TokenProfileFlags()),
])
}

func testSelfCheckAcceptsDuplicateByteTokens() throws {
let profile = try makeDuplicateProfile()
let report = ProfileSelfCheck.runAll(on: profile)
XCTAssertTrue(
report.isSuccess,
"self-check should accept duplicate-byte tokens; failures: \(report.failures)"
)
XCTAssertFalse(
report.failures.contains { $0.check == "triePresence" },
"triePresence must not flag a genuine duplicate-byte token"
)
}

func testDuplicateBytesResolveToOneTerminalNode() throws {
let profile = try makeDuplicateProfile()
guard let state = profile.prefixStart(requiredBytes: Self.duplicateBytes) else {
return XCTFail("duplicate bytes not walkable from root")
}
// The byte-keyed trie stores exactly one terminal id for the shared "ab" node; it
// is one of the two duplicates (which one depends only on writer insertion order,
// an implementation detail we deliberately don't pin).
let terminal = profile.terminalTokenID(at: state)
XCTAssertTrue(
terminal == 2 || terminal == 3,
"terminal at the shared 'ab' node should be a duplicate id, got \(terminal as Any)"
)
}

func testTokenAllowedAcceptsBothDuplicates() throws {
let profile = try makeDuplicateProfile()
let root = TrieState(nodeIndex: 0)
// Both ids walk to the shared terminal node, so both must be admissible even
// though only one is stored as that node's terminal id.
XCTAssertTrue(profile.tokenAllowed(2, in: root), "duplicate id 2 must be allowed")
XCTAssertTrue(profile.tokenAllowed(3, in: root), "duplicate id 3 must be allowed")
}

func testExcludedDuplicateIsNotAdmittedByTrieState() throws {
// id 2 "ab" is allowed; id 3 "ab" is excluded. The writer inserts only id 2, so
// the shared "ab" node's terminal is 2. Querying the excluded id 3 must NOT be
// admitted just because its bytes match a non-excluded duplicate — excluded tokens
// are never trie members, and suppression must hold.
let profile = try makeProfile([
(0, Array("a".utf8), TokenProfileFlags()),
(1, Array("b".utf8), TokenProfileFlags()),
(2, Self.duplicateBytes, TokenProfileFlags()),
(3, Self.duplicateBytes, [.excluded]),
])
let root = TrieState(nodeIndex: 0)
XCTAssertTrue(profile.tokenAllowed(2, in: root), "allowed duplicate id 2 must be admitted")
XCTAssertFalse(profile.tokenAllowed(3, in: root), "excluded duplicate id 3 must NOT be admitted")
// The self-check skips excluded tokens and still validates the allowed one.
let report = ProfileSelfCheck.runAll(on: profile)
XCTAssertTrue(report.isSuccess, "self-check failures: \(report.failures)")
}
}
35 changes: 35 additions & 0 deletions docs/05-decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ row here.**
| 055 | Drop a word break the model emits after a healed stem | generation |
| 056 | Mid-word quality: accurate OCR, dead-end + charset guards | generation |
| 057 | Mid-line FIM quality: truncate-at-overlap, suffix rerank, windowing | generation |
| 059 | Trie self-check tolerates duplicate-byte tokens (Gemma) | token-profiles |

---

Expand Down Expand Up @@ -2379,3 +2380,37 @@ text. Both are now closed:
completion/AX pipeline is untouched — this is purely presentation. Trade-off: SwiftUI `onDisappear`
semantics for `Window` scenes are the close signal; if a future macOS regression stops firing it on a
red-button close, the revert would need an `NSWindow` close observer instead.

## ADR-059 — Prefix-trie self-check tolerates duplicate-byte tokens (Gemma)

- Date: 2026-06-01
- Status: accepted
- Context: Selecting a Gemma model (`gemma-v262144`, ADR-035) aborted in-app profile
generation with `Profile self-check failed: - [triePresence] token 239 reached state 2
but terminal=Optional(249732)`, and ADR-052 then deleted the half-built artifact, so the
model was unusable. Root cause: Gemma's vocabulary contains *duplicate* tokens — distinct
ids whose raw bytes are byte-for-byte identical (here 239 and 249732). The ACPF prefix
trie (ADR-009) is keyed purely on bytes and stores a single `terminal_token_id` per node,
so duplicates collide on one node and only the last writer's id survives as the terminal.
`ProfileSelfCheck.checkTriePresence` asserted exact identity (`terminal == id`) for every
non-excluded token, which is unsatisfiable when two non-excluded tokens share bytes. The
same wrong assumption sat in the trie-state `MmapAutocompleteProfile.tokenAllowed(_:in:)`
(`terminal == id`), a latent admissibility bug for any duplicate-byte tokenizer.
- Decision: treat the trie as a *byte oracle*, not a token-id map. Walking a non-excluded
token's bytes from the root must reach a **terminal** node; a different stored terminal id
is accepted **only** when its bytes are byte-for-byte identical (a genuine duplicate). A
non-terminal node, or a terminal whose bytes differ, is still a hard failure. The same
byte-equality rule replaces the id-identity check in `tokenAllowed(_:in:)`, which also
rejects ids the trie builder would exclude (base `.excluded` flag) so a non-excluded
duplicate cannot make an excluded token admissible. No schema or
writer change: the on-disk format, the byte-based runtime admissibility path
(`tokenAllowed(_:afterRequiredPrefix:)`, used by the decoder) and the per-record
`trieTerminal` jump field are untouched, so existing profiles keep loading.
- Consequences: Gemma (and any duplicate-byte tokenizer) builds and validates again; the
self-check still catches real corruption (missing path / non-terminal / bytes mismatch).
The trie deliberately cannot tell duplicates apart — which id wins a shared node is an
insertion-order detail and intentionally not asserted. `terminalTokenID(at:)` stays a raw
accessor (returns whatever id is stored). Tests: `DuplicateTokenTrieTests` builds profiles
with identical-byte ids and asserts the self-check passes, both non-excluded duplicates are
admissible, and an excluded duplicate is rejected; the strict `TriePresenceTests` still
guards the no-duplicate fixture.