fix(normalize): strip placeholder-escape PUA sentinels before detection#43
Merged
Conversation
GLiNER tags the bare PUA pair (U+E000 U+E001) used to escape user-authored
`{{` as `private_person`. Vault substitution then replaces the leading
sentinel with `{{PII_PRIVATE_PERSON_*}}`, corrupting templates like
`{{short-kebab-case-slug}}` into `{{PII_..._}}short-kebab-case-slug}}`.
Root cause: any-ascii returns "" for PUA chars; transcodeChar falls
through to preserve them verbatim in the normalized text, so the
tokenizer sees them as UNK tokens that the model labels as person.
Fix: add the sentinels to ZERO_WIDTH_CHARS so normalizeForDetection
strips them before detection. Model never sees the sentinels; offset
remap covers the original positions, so vault substitution skips them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same runtime bytes — but raw PUA in source rendered as random emojis on GitHub mobile (iOS substitutes PUA with Apple's private emoji font) and is invisible / unsearchable in most editors. Escape syntax is grep-friendly and self-documents the codepoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
U+E000 U+E001) used to escape user-authored{{asprivate_person. Vault substitution then replaced the leading sentinel with{{PII_PRIVATE_PERSON_*}}, corrupting templates like{{short-kebab-case-slug}}into{{PII_..._}}short-kebab-case-slug}}.any-asciireturns""for PUA chars;transcodeCharfalls through to preserve them verbatim, so the tokenizer sees them as UNK tokens that the model labels as a person.ZERO_WIDTH_CHARSsonormalizeForDetectionstrips them before detection. Model never sees the sentinels; offset remap covers the original positions, so vault substitution skips them.Test plan
npm test— 271 passing (new: PUA strip + remap innormalize.test.ts, round-trip integration on{{template}} + emailinnullpii.test.ts)npm run typecheck— cleannpm run lint— cleannpm run build— clean{{short-kebab-case-slug}}in system prompt — verify template survives sanitize round-tripOut of scope
short-kebab-case-slugtagged as person → produces{{{{PII_..._}}}}nested-braces syntax). Separate harder problem — needs template-aware skip ranges or grammar-aware detector pass.🤖 Generated with Claude Code