Fix: detect PII / encoded payloads split across sibling text nodes (#203)#210
Conversation
) The inline-text-redact factory (pii-redact / secrets-redact / encoded-payload-redact) used to scan one text node at a time, so a card number rendered as `<span>4111</span> <span>1111</span> ...` slipped past every regex — the bypass tracked as Critical-3 in the #203 audit. Group text nodes by inline-formatting context during the walk (block elements, `<br>`, and shadow roots start a new group) and run collectMatches against the concatenated group value. Matches that straddle a node boundary land via a new `replaceMatchAcrossTextNodes` helper that scrubs the carrier text but leaves wrapping inline elements in place — same "scrub, don't detach framework-owned nodes" pattern the rest of the codebase already follows. Closes #203 item 3 (PII / Critical) and item 7 (encoded-payload split-across-whitespace / High) with one shared change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
1 issue found.
About Unblocked
Unblocked has been set up to automatically review your team's pull requests to identify genuine bugs and issues.
📖 Documentation — Learn more in our docs.
💬 Ask questions — Mention @unblocked to request a review or summary, or ask follow-up questions.
👍 Give feedback — React to comments with 👍 or 👎 to help us improve.
⚙️ Customize — Adjust settings in your preferences.
| for (const match of matches.toReversed()) { | ||
| const start = locateStart(layout, match.start); | ||
| const end = locateEnd(layout, match.end); | ||
| if (start.index === end.index) { | ||
| const node = layout.nodes[start.index]; | ||
| if (!node) { | ||
| continue; | ||
| } | ||
| replaceMatchesInTextNode( | ||
| node, | ||
| [{ start: start.offset, end: end.offset, label: match.label }], | ||
| ruleId, | ||
| ); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
When two or more non-overlapping matches both resolve to the same start.index === end.index (i.e., both sit entirely within the same text node in a multi-node bucket), replaceMatchesInTextNode is called once per match with a single-element array. The first call (rightmost, due to toReversed()) detaches the original text node from the DOM via parentNode.replaceChild(fragment, textNode). Every subsequent call for the same node finds textNode.parentNode === null, so the ?.replaceChild is a no-op and the match is silently dropped — leaking the sensitive data it was supposed to redact.
Example: <p><span>SSN: 123-45-6789 Card: 4111 1111 1111 1111</span><span> end</span></p> — the card match is processed first and detaches node 0; the SSN match then fails silently.
Fix: group all single-node matches by their node index and pass each group as a batch to replaceMatchesInTextNode, which already handles multiple matches correctly in a single call. Something like:
const singleNodeMatches = new Map<number, InlineMatch[]>();
const crossNodeMatches: typeof matches = [];
for (const match of matches) {
const start = locateStart(layout, match.start);
const end = locateEnd(layout, match.end);
if (start.index === end.index) {
let arr = singleNodeMatches.get(start.index);
if (!arr) { arr = []; singleNodeMatches.set(start.index, arr); }
arr.push({ start: start.offset, end: end.offset, label: match.label });
} else {
crossNodeMatches.push(match);
}
}
// Apply cross-node matches last-to-first (existing logic)
for (const match of crossNodeMatches.toReversed()) { ... }
// Apply batched single-node matches (one replaceMatchesInTextNode call per node)
for (const [index, nodeMatches] of singleNodeMatches) {
const node = layout.nodes[index];
if (node) {
replaceMatchesInTextNode(node, nodeMatches, ruleId);
}
}There was a problem hiding this comment.
Fixed in 201f44c — confirmed the bug reproduces (SSN + card in one span of a two-span bucket: only the rightward match landed; the SSN leaked) and that the same root cause covered a related shape (cross-node phone + single-node card sharing a node, where the phone's lastNode truncation silently no-op'd because card had already detached it, leaving 4567 visible).
The sequential-application model is fundamentally racy here — any operation that detaches a text node breaks every later operation targeting it, and any operation that truncates a boundary node shifts every later offset in that node. Rather than try to order around it, I batched: new replaceMatchesAcrossTextNodes (plural) snapshots every node's original nodeValue up front, plans each affected node's content from that snapshot, and applies one replaceChild per node. Each text node is mutated exactly once, so no offsets ever go stale.
Regression tests cover both shapes (your example + the cross-node variant). Full suite is green (1718 tests).
— Claude Code, on behalf of @twschiller
…tion unblocked[bot] flagged that sequential per-match application in processBucket detached text nodes mid-loop, so a second single-node match in the same node silently dropped. Same root cause covers a related case where a cross-node match shares its lastNode with a single-node match — sequential application either dropped the match or shifted offsets so the wrong characters were redacted. Replace the per-match loop with a plural `replaceMatchesAcrossTextNodes` that snapshots every original `nodeValue` up front, plans each affected node's new content from that snapshot, then applies one replaceChild (or value-blank for fully-consumed interior nodes) per node. Each text node is mutated exactly once, so no offsets ever go stale. Regression tests cover both shapes: - two single-node matches sharing a text node in a multi-node bucket - cross-node match + single-node match sharing a node Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
<span>-per-digit-group cards) used to slip past the per-text-node regex.Approach
collectTextNodesWithInlineGroups(new) tags each text node with an inline-formatting-context id. Group id bumps on entering/leaving a block-level element, on<br>, and on shadow-root boundaries. The existingcollectTextNodesShadowPiercingbecomes a thin wrapper that drops the group tag — no callers of the non-group walker change.walkTextNodeGroupsChunked(new sibling ofwalkTextNodesChunked) threads the group ids through the same chunked-yield driver.defineInlineTextRedactRuleswitches to the new walker. Each chunk is bucketed by group; single-node buckets keep the existing single-node fast path; multi-node buckets concatenate values, runcollectMatchesagainst the concatenation, and route each match to eitherreplaceMatchesInTextNode(within-node) orreplaceMatchAcrossTextNodes(new).replaceMatchAcrossTextNodesscrubs the carrier — first node keeps its prefix, last node keeps its suffix, interior nodes have theirnodeValueblanked. Wrapping inline elements stay in the DOM (per the "scrub, don't detach framework-owned nodes" rule). Reveal restores the matched characters as a single text node at the placeholder's position.minLengthno longer filters at the walker level (it would drop legitimate cross-node candidates whose individual nodes sit below the floor); the group's concatenated length is checked instead, preserving the cheap early-out for prose nodes.Demo site
Checkout.tsxexercises five span-split shapes (card, SSN, phone, JWT, base64 payload) so the before/after diff visualizes the closed gap.Test plan
bun run check(extension)bun run typecheck(extension)bun run test(extension — 1716 tests, all green; 16 new cross-node tests across pii / secrets / encoded-payload / placeholder helper + 4 new property tests)bun run check(demo-site)pre-commit run --all-files[…hidden]chips, and reveal-on-click should restore the original characters.Notes / accepted trade-offs
chunkSize=100.🤖 Generated with Claude Code