refactor(converter): replace regex-based markdown→storage with htmlparser2 walker#153
Merged
Merged
Conversation
…rser2 walker Markdown→storage conversion was driven by a 173-line regex pipeline over the rendered HTML. Nested blockquotes made the lazy `<blockquote>(.*?)</blockquote>` matcher attach to an inner closing tag, code-block CDATA literals containing `<blockquote>` confused the same matcher, and recent fix commits in this area (the admonition / blockquote / marker patches) were workarounds. Mirrors the parallel storage-walker refactor for the opposite direction. Introduce `lib/html-to-storage.js`: parses with htmlparser2 (`decodeEntities: false` for byte-identical entity round-trip) and walks the DOM dispatching by tag. Each markdown-it block / inline element plus the existing macro markers (TOC / ANCHOR / EXPAND / INFO|WARNING|NOTE callouts) and linkStyle branches (smart / plain / wiki) has a dedicated case. Recursion is capped at 256 levels via a typed `HtmlDepthExceededError`. `htmlToConfluenceStorage` shrinks 173 → 2 lines, delegating to the walker. `markdownToStorage` and `markdownToNativeStorage` route through it, so all three methods share the backend. - The markdown-it `[!info]` shorthand preprocessor stays in place — the markdown-source line-break info it needs for the single-line `[!info] body` form is not recoverable at tree level. - 548 tests pass; lint clean. Coverage adds 5 nested-blockquote regression tests, 4 integration smoke tests, and a 56-test walker suite with depth-guard guarantees.
38fc12c to
237a023
Compare
Owner
|
Thanks for putting this together — LGTM 🚀 As the mirror of #137, this applies the same htmlparser2-based walker approach consistently to the storage direction. Beyond the refactor, it also fixes some real V1 bugs along the way (nested-blockquote unbalance, malformed XML on multi-attribute links, etc.). What I particularly liked:
A few small follow-ups (non-blocking):
Happy to handle these in a follow-up PR or separate issue. Great work — appreciate the careful approach! |
|
🎉 This PR is included in version 2.1.9 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Contributor
Author
|
Thanks for the careful review! Items 2 & 3: Verified the actual behavior on Confluence and added Behavioral Notes accordingly. |
7 tasks
pchuri
pushed a commit
that referenced
this pull request
May 5, 2026
#170) Addresses follow-up 1 from #153: <ul>/<ol>/<li> were the only block-level elements in lib/html-to-storage.js dropping their attributes. Threads renderAttrs(node.attribs) through the same way the <table> family already does. <li> uses the open-variable hoist pattern from <th>/<td> so wrap and unwrap branches share it. Adds 4 regression tests covering wrap path, unwrap path, <ul>/<ol> attributes, and <ol start>.
15 tasks
github-actions Bot
pushed a commit
that referenced
this pull request
May 6, 2026
# [2.5.0](v2.4.0...v2.5.0) (2026-05-06) ### Bug Fixes * **deps:** bump axios to ~1.15.2 to address security advisories ([#174](#174)) ([0a1492b](0a1492b)), closes [GHSA-w9j2-pv#6h63](https://github.com/GHSA-w9j2-pv/issues/6h63) [#173](#173) * **walker:** preserve attributes on <ul>/<ol>/<li> in markdown→storage ([#170](#170)) ([b5c172a](b5c172a)), closes [#153](#153) ### Features * add page version listing and purge commands ([#171](#171)) ([2bd5c37](2bd5c37))
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
Companion PR to #137 in the opposite direction. This work was paused as the refactor is fairly large in scope and I wanted to align on the architectural direction. After #137 landed and confirmed the choice (htmlparser2 + DOM walker), this PR resumes that effort, applying the same approach to the reverse path.
Description
The mirror of #137 in the opposite direction. Replaces the 173-line
regex pipeline in
htmlToConfluenceStorage(and through it,markdownToStorage/markdownToNativeStorage) with an htmlparser2DOM walker.
The previous regex pipeline had the same structural fragility #137
documented for
storageToMarkdown:<blockquote>(.*?)</blockquote>could not track depth acrossnested macros, so
> > **INFO**lost its outer closing tag.<blockquote>confused the blockquoteregex from inside code blocks.
[!info]admonition rewrite had to anchor itself defensivelybecause it ran on raw markdown text (fix: scope
[!info]admonition rewrite to block context #134).paragraph serialization shape.
Recent fix commits in this area (#132, #134, #136) were all workarounds
for that structural fragility.
Type of Change
passes over the same string)
What changed
lib/html-to-storage.js(new)<p>-wrap quirk on tight items), code blocks (CDATA + language +]]>split) and inline code, links (smart/plain/wiki + anchor#idbranch), tables, blockquote with INFO/WARNING/NOTE marker detection, TOC/ANCHOR/EXPAND paragraph markers, and void tags. Parser usesdecodeEntities: falseso text and attribute entities round-trip byte-identical. Recursion is capped at 256 levels via a typedHtmlDepthExceededError.lib/macro-converter.jshtmlToConfluenceStorageshrinks 173 → 2 lines, delegating tohtmlToStorage.markdownToStorageandmarkdownToNativeStorageroute through it, so all three methods share the walker backend.tests/html-to-storage.test.js(new)tests/macro-converter.test.js> > **INFO**unbalance bug plus three-level nesting and sibling blockquotes), +4 integration smoke tests (CDATA-adjacent-marker, publichtmlToConfluenceStoragepath, task-list literal-text behavior, table-cell<p>wrap).Testing
mirroring refactor(converter): replace regex-based storageToMarkdown with htmlparser2 walker #137's testing approach.
markdown-it's serialization shape, including the depth guard.
Local validation during development
Beyond the committed test suite, walker development used a 19-case
golden fixture A/B diff harness (16 markdown + 3 HTML, covering all
handler categories — headings, blockquote markers and nesting,
admonitions, CDATA-with-HTML, TOC / ANCHOR / EXPAND, tables, link
styles ×3, task lists, deeply nested lists, mixed content, inline
edges) and reached byte parity 0/19 between the V1 entry point and a
direct walker call.
The fixture infrastructure was removed from this PR following #137's
"existing tests act as regression net" approach; the four new
integration smoke tests in
tests/macro-converter.test.jscover thecorners not transitively covered by existing tests.
What is intentionally not changed
The markdown-it preprocessor that rewrites the
[!info]shorthand intothe canonical
> **INFO**blockquote form (added in #134) stays inplace. It operates on markdown source where line-break information is
needed to support both the
[!info]\nbodynewline form and thesingle-line
[!info] bodyform. Migrating it to the tree level wouldlose the single-line form's body delimiter and silently break
backwards compatibility.
Behavioral Notes
For raw-HTML input via
--input-format html, the walker's<p>-wrapbehavior on
<li>/<th>/<td>differs from the previous regexpipeline in a few edge cases. The walker decides via inline-tag
membership + newline absence rather than the previous "single-line
regex match" rule:
<li><p>existing</p></li>(loose-list canonical form, common in Confluence storage round-trips)
no longer gets double-wrapped to
<li><p><p>existing</p></p></li>.The previous regex captured the existing
<p>and re-wrapped it; thewalker recognizes
<p>as a block child and skips wrap.<li>containing only HTML5 phrasingelements not covered by the walker's
INLINE_TAGSset (e.g.<meter>,<output>) won't get the<p>wrap the regex would haveapplied. markdown-it doesn't emit these so the markdown→storage path
is unaffected; only hand-authored raw HTML triggers it.
HTML comments (
<!-- ... -->) in raw HTML input are dropped duringconversion rather than preserved in the output payload. This diverges
from the previous regex pipeline (which left them untouched), but
Confluence's storage layer strips HTML comments server-side regardless,
so the rendered page is unchanged.
Extra
<a>attributes (title,class, etc.) on raw-HTML input aredropped in
wikilinkStyle output (onlyhrefis preserved via<ri:url ri:value>). The previous regex pipeline matched only theexact
<a href="...">shape and passed any decorated<a>throughunchanged. Affects only Server/DC users explicitly choosing
wiki—Cloud's storage format does not render the wiki form at all (shown as
"unsupported macro" in the editor), so Cloud defaults to
smart.The dominant markdown-it path is byte-identical to the previous regex
pipeline. The CLI
--input-format htmland--format htmlpathscontinue to work, now backed by the same walker. The
[!info]shorthand path is unchanged from the user's perspective.
Checklist