Skip to content

fix(walker): preserve <u>/<sub>/<sup>/<mark> on storage ↔ markdown round-trip#167

Merged
pchuri merged 4 commits into
mainfrom
fix/preserve-u-sub-sup-mark-roundtrip
May 4, 2026
Merged

fix(walker): preserve <u>/<sub>/<sup>/<mark> on storage ↔ markdown round-trip#167
pchuri merged 4 commits into
mainfrom
fix/preserve-u-sub-sup-mark-roundtrip

Conversation

@pchuri
Copy link
Copy Markdown
Owner

@pchuri pchuri commented May 3, 2026

Summary

Picks option 2 from #155 (selective tag whitelist) for <u>, <sub>, <sup>, <mark>. These four inline tags now survive storage ↔ markdown ↔ storage end-to-end. Closes #155.

What changed

  • lib/storage-walker.js — adds the four tags to the existing <details>/<summary> raw-HTML pass-through case, so storage→markdown emits e.g. H<sub>2</sub>O instead of dropping the wrapper to H2O.
  • lib/macro-converter.js — wraps markdownToStorage / markdownToNativeStorage with a stash/restore that hides whitelisted tags from MarkdownIt (which is constructed with html: false) and re-injects them after rendering. Code fences and inline code are stashed first, so a literal <u> inside a backtick block stays escaped as text rather than being passed through.
  • tests/macro-converter.test.js — 11 new tests covering walker output for each tag, full markdown→storage→markdown round-trip, attribute preservation, the <script> escape (proves the whitelist boundary holds), and the inline-code / fenced-code carve-outs.

Why this approach over the alternatives in the issue

  • Option 1 (html: true on MarkdownIt) — too broad. It would let arbitrary user HTML reach Confluence storage XHTML and rely on Confluence's sanitizer as the safety boundary. The four tags above are the only documented information loss; opening the whole HTML grammar to address it is disproportionate.
  • Option 3 (status quo) — leaves real information loss for chemistry/math docs (<sub>/<sup>) and any <u>/<mark> use, including content created via the Confluence editor.
  • Option 2 (this PR) — fixes the four tags the issue identifies, keeps the safety boundary surgical, and re-uses the STASH_DELIM pattern already in setupConfluenceMarkdownExtensions. No new dependencies.

<details>/<summary> were mentioned parenthetically in the issue but deferred — they're block-level and need different placeholder shaping to avoid <p>-wrap interactions. Out of scope here.

Verification

npx jest          # 612 passing (was 601; +11 new)
npx eslint lib/macro-converter.js lib/storage-walker.js tests/macro-converter.test.js   # clean

Reproduction from the issue, after the fix:

const c = new MacroConverter({ isCloud: true });
c.storageToMarkdown('<p>H<sub>2</sub>O</p>');                  // 'H<sub>2</sub>O'
c.markdownToStorage(c.storageToMarkdown('<p>H<sub>2</sub>O</p>'));  // '<p>H<sub>2</sub>O</p>\n'
c.markdownToStorage('<script>alert(1)</script>');              // '<p>&lt;script&gt;alert(1)&lt;/script&gt;</p>\n'
c.markdownToStorage('`<u>x</u>`');                             // '<p><code>&lt;u&gt;x&lt;/u&gt;</code></p>\n'

Test plan

  • New unit tests for each of the four tags (storage→markdown direction)
  • Round-trip tests for <sub> and <mark>
  • Whitelist boundary test: <script> is still escaped
  • Code carve-out tests: <u> inside inline code and inside fenced code is preserved as literal text
  • markdownToNativeStorage shares the same passthrough policy
  • Full npx jest suite (612 passing, no regressions)

pchuri added 4 commits May 3, 2026 18:18
…und-trip

Markdown has no native syntax for these inline tags, so the walker dropped
them on storage→markdown and MarkdownIt (html: false) escaped them back to
literal &lt;...&gt; on markdown→storage. Both directions now round-trip:

- storage-walker: emit raw HTML for the four tags, mirroring <details>/<summary>.
- macro-converter: stash whitelisted tags around MarkdownIt's render so they
  survive the html: false escape, while code fences and inline code still
  treat them as literal text. Other HTML (e.g. <script>) remains escaped.

Closes #155
Self-review follow-up for #167:

- Remove `String(markdown)` wrap in `_renderMarkdownToHtml`. The original
  pre-PR code passed `markdown` straight to `markdown-it.render`, which
  throws on non-string input. Coercing turns `undefined` / `null` / `123`
  into `'<p>undefined</p>'` etc. — silent corruption that is worse than
  the prior fail-fast behavior.
- Add round-trip parity tests for `<u>` and `<sup>`; the original commit
  only exercised `<sub>` and `<mark>`.
- Add an explicit test that the walker strips attributes on whitelisted
  tags (mirrors the existing `<details>` / `<summary>` precedent and is
  asymmetric with the markdown→storage path that preserves them).
Self-review #2 caught a data-loss regression introduced by the original
PR: a 4-space-indented `<u>x</u>` ended up emitted as `<![CDATA[]]>` —
the entire code body silently dropped.

Root cause: `CODE_BLOCK_RE` only matched fenced (```/~~~) and inline
backtick code, so the passthrough stash replaced `<u>` and `</u>` inside
indented code with placeholders. MarkdownIt then rendered them as code-
body text, but the post-render restore re-injected raw `<u>` into the
`<pre><code>` body. htmlparser2 re-parsed that as a real tag, and
convertCodeBlock — which only collects direct text children — discarded
everything inside the `<u>` wrapper.

Fix: detect code regions via MarkdownIt's tokenizer (`code_block` and
`fence` tokens, plus a regex for `code_inline` since MarkdownIt does not
expose source positions for that one). The tokenizer correctly
distinguishes indented code from list-item continuations that happen to
align to four spaces, which a regex-only approach can't do.

- New `_findCodeRanges` returns merged char ranges of all code regions.
- `_renderMarkdownToHtml` slices around those ranges and only stashes
  HTML in non-code prose.
- Drops the now-unused `CODE_BLOCK_RE` constant; inline-code detection
  moved into `INLINE_CODE_RE`.
- Adds two regression tests: indented code preservation, and list-item
  continuation NOT being treated as code.
Two regressions reported in external review of #167:

1. `<mark title="1>0">x</mark>` collapsed to `<p></p>` — the body part
   of `PASSTHROUGH_TAG_RE` was `[^>]*`, so the first `>` inside a
   quoted attribute value terminated the match prematurely. The
   resulting half-tag was stashed, the rest of the input went to
   markdown-it as plain text, and the final HTML had nothing left to
   render.

2. Markdown autolinks like `<u@example.com>` and `<sub:foo>` were
   stashed as if they were `<u>` / `<sub>` tags. The `\b` word boundary
   after the tag name happily matched `u` followed by `@` or `sub`
   followed by `:`, so linkify never got a chance to convert them and
   the output ended up with bogus `<u@example.com></u@example.com>`
   pairs in place of the expected mailto link.

Fix the regex on two axes:

- Replace `\b` with `(?=[\s/>])` — only HTML tag delimiters (whitespace,
  `/`, `>`) end the tag name, so autolink shapes are left for
  markdown-it. This also tightens the harmless-but-loose match of
  custom-element-shaped names like `<u-foo>`.
- Replace `[^>]*` with `(?:"[^"]*"|'[^']*'|[^>])*` so a `>` inside a
  quoted attribute value doesn't close the tag.

Tests cover: quoted `>` round-trip, single-quoted attribute (normalized
on output), email and URI autolinks, and hyphenated custom-element
names being escaped rather than passed through.
@pchuri pchuri merged commit a6857db into main May 4, 2026
6 checks passed
github-actions Bot pushed a commit that referenced this pull request May 4, 2026
## [2.3.1](v2.3.0...v2.3.1) (2026-05-04)

### Bug Fixes

* **walker:** preserve <u>/<sub>/<sup>/<mark> on storage ↔ markdown round-trip ([#167](#167)) ([a6857db](a6857db)), closes [#155](#155) [#2](#2)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🎉 This PR is included in version 2.3.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decide policy for <u>/<sub>/<sup>/<mark> on storage → markdown export

1 participant