Skip to content

fix(normalize): strip placeholder-escape PUA sentinels before detection#43

Merged
lBroth merged 2 commits into
mainfrom
fix/template-escape-pua
May 21, 2026
Merged

fix(normalize): strip placeholder-escape PUA sentinels before detection#43
lBroth merged 2 commits into
mainfrom
fix/template-escape-pua

Conversation

@lBroth
Copy link
Copy Markdown
Owner

@lBroth lBroth commented May 21, 2026

Summary

  • GLiNER was tagging the bare PUA pair (U+E000 U+E001) used to escape user-authored {{ as private_person. Vault substitution then replaced the leading sentinel with {{PII_PRIVATE_PERSON_*}}, corrupting templates like {{short-kebab-case-slug}} into {{PII_..._}}short-kebab-case-slug}}.
  • Root cause: any-ascii returns "" for PUA chars; transcodeChar falls through to preserve them verbatim, so the tokenizer sees them as UNK tokens that the model labels as a person.
  • Fix: add the sentinels to ZERO_WIDTH_CHARS so normalizeForDetection strips them before detection. Model never sees the sentinels; offset remap covers the original positions, so vault substitution skips them.

Test plan

  • npm test — 271 passing (new: PUA strip + remap in normalize.test.ts, round-trip integration on {{template}} + email in nullpii.test.ts)
  • npm run typecheck — clean
  • npm run lint — clean
  • npm run build — clean
  • Repro on gateway with real Anthropic Messages API call carrying {{short-kebab-case-slug}} in system prompt — verify template survives sanitize round-trip

Out of scope

  • Model still false-positives on template content itself (e.g. short-kebab-case-slug tagged as person → produces {{{{PII_..._}}}} nested-braces syntax). Separate harder problem — needs template-aware skip ranges or grammar-aware detector pass.

🤖 Generated with Claude Code

lBroth and others added 2 commits May 21, 2026 08:31
GLiNER tags the bare PUA pair (U+E000 U+E001) used to escape user-authored
`{{` as `private_person`. Vault substitution then replaces the leading
sentinel with `{{PII_PRIVATE_PERSON_*}}`, corrupting templates like
`{{short-kebab-case-slug}}` into `{{PII_..._}}short-kebab-case-slug}}`.

Root cause: any-ascii returns "" for PUA chars; transcodeChar falls
through to preserve them verbatim in the normalized text, so the
tokenizer sees them as UNK tokens that the model labels as person.

Fix: add the sentinels to ZERO_WIDTH_CHARS so normalizeForDetection
strips them before detection. Model never sees the sentinels; offset
remap covers the original positions, so vault substitution skips them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same runtime bytes — but raw PUA in source rendered as random emojis on
GitHub mobile (iOS substitutes PUA with Apple's private emoji font) and
is invisible / unsearchable in most editors. Escape syntax is grep-friendly
and self-documents the codepoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lBroth lBroth merged commit 425f450 into main May 21, 2026
7 checks passed
@lBroth lBroth deleted the fix/template-escape-pua branch May 21, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant