fix(walker): preserve <u>/<sub>/<sup>/<mark> on storage ↔ markdown round-trip#167
Merged
Conversation
…und-trip Markdown has no native syntax for these inline tags, so the walker dropped them on storage→markdown and MarkdownIt (html: false) escaped them back to literal <...> on markdown→storage. Both directions now round-trip: - storage-walker: emit raw HTML for the four tags, mirroring <details>/<summary>. - macro-converter: stash whitelisted tags around MarkdownIt's render so they survive the html: false escape, while code fences and inline code still treat them as literal text. Other HTML (e.g. <script>) remains escaped. Closes #155
Self-review follow-up for #167: - Remove `String(markdown)` wrap in `_renderMarkdownToHtml`. The original pre-PR code passed `markdown` straight to `markdown-it.render`, which throws on non-string input. Coercing turns `undefined` / `null` / `123` into `'<p>undefined</p>'` etc. — silent corruption that is worse than the prior fail-fast behavior. - Add round-trip parity tests for `<u>` and `<sup>`; the original commit only exercised `<sub>` and `<mark>`. - Add an explicit test that the walker strips attributes on whitelisted tags (mirrors the existing `<details>` / `<summary>` precedent and is asymmetric with the markdown→storage path that preserves them).
Self-review #2 caught a data-loss regression introduced by the original PR: a 4-space-indented `<u>x</u>` ended up emitted as `<![CDATA[]]>` — the entire code body silently dropped. Root cause: `CODE_BLOCK_RE` only matched fenced (```/~~~) and inline backtick code, so the passthrough stash replaced `<u>` and `</u>` inside indented code with placeholders. MarkdownIt then rendered them as code- body text, but the post-render restore re-injected raw `<u>` into the `<pre><code>` body. htmlparser2 re-parsed that as a real tag, and convertCodeBlock — which only collects direct text children — discarded everything inside the `<u>` wrapper. Fix: detect code regions via MarkdownIt's tokenizer (`code_block` and `fence` tokens, plus a regex for `code_inline` since MarkdownIt does not expose source positions for that one). The tokenizer correctly distinguishes indented code from list-item continuations that happen to align to four spaces, which a regex-only approach can't do. - New `_findCodeRanges` returns merged char ranges of all code regions. - `_renderMarkdownToHtml` slices around those ranges and only stashes HTML in non-code prose. - Drops the now-unused `CODE_BLOCK_RE` constant; inline-code detection moved into `INLINE_CODE_RE`. - Adds two regression tests: indented code preservation, and list-item continuation NOT being treated as code.
Two regressions reported in external review of #167: 1. `<mark title="1>0">x</mark>` collapsed to `<p></p>` — the body part of `PASSTHROUGH_TAG_RE` was `[^>]*`, so the first `>` inside a quoted attribute value terminated the match prematurely. The resulting half-tag was stashed, the rest of the input went to markdown-it as plain text, and the final HTML had nothing left to render. 2. Markdown autolinks like `<u@example.com>` and `<sub:foo>` were stashed as if they were `<u>` / `<sub>` tags. The `\b` word boundary after the tag name happily matched `u` followed by `@` or `sub` followed by `:`, so linkify never got a chance to convert them and the output ended up with bogus `<u@example.com></u@example.com>` pairs in place of the expected mailto link. Fix the regex on two axes: - Replace `\b` with `(?=[\s/>])` — only HTML tag delimiters (whitespace, `/`, `>`) end the tag name, so autolink shapes are left for markdown-it. This also tightens the harmless-but-loose match of custom-element-shaped names like `<u-foo>`. - Replace `[^>]*` with `(?:"[^"]*"|'[^']*'|[^>])*` so a `>` inside a quoted attribute value doesn't close the tag. Tests cover: quoted `>` round-trip, single-quoted attribute (normalized on output), email and URI autolinks, and hyphenated custom-element names being escaped rather than passed through.
|
🎉 This PR is included in version 2.3.1 🎉 The release is available on: Your semantic-release bot 📦🚀 |
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Picks option 2 from #155 (selective tag whitelist) for
<u>,<sub>,<sup>,<mark>. These four inline tags now survive storage ↔ markdown ↔ storage end-to-end. Closes #155.What changed
lib/storage-walker.js— adds the four tags to the existing<details>/<summary>raw-HTML pass-through case, so storage→markdown emits e.g.H<sub>2</sub>Oinstead of dropping the wrapper toH2O.lib/macro-converter.js— wrapsmarkdownToStorage/markdownToNativeStoragewith a stash/restore that hides whitelisted tags from MarkdownIt (which is constructed withhtml: false) and re-injects them after rendering. Code fences and inline code are stashed first, so a literal<u>inside a backtick block stays escaped as text rather than being passed through.tests/macro-converter.test.js— 11 new tests covering walker output for each tag, full markdown→storage→markdown round-trip, attribute preservation, the<script>escape (proves the whitelist boundary holds), and the inline-code / fenced-code carve-outs.Why this approach over the alternatives in the issue
html: trueon MarkdownIt) — too broad. It would let arbitrary user HTML reach Confluence storage XHTML and rely on Confluence's sanitizer as the safety boundary. The four tags above are the only documented information loss; opening the whole HTML grammar to address it is disproportionate.<sub>/<sup>) and any<u>/<mark>use, including content created via the Confluence editor.STASH_DELIMpattern already insetupConfluenceMarkdownExtensions. No new dependencies.<details>/<summary>were mentioned parenthetically in the issue but deferred — they're block-level and need different placeholder shaping to avoid<p>-wrap interactions. Out of scope here.Verification
Reproduction from the issue, after the fix:
Test plan
<sub>and<mark><script>is still escaped<u>inside inline code and inside fenced code is preserved as literal textmarkdownToNativeStorageshares the same passthrough policynpx jestsuite (612 passing, no regressions)