refactor(converter): replace regex-based markdown→storage with htmlparser2 walker by devlimits · Pull Request #153 · pchuri/confluence-cli

devlimits · 2026-04-30T04:07:08Z

Background

Companion PR to #137 in the opposite direction. This work was paused as the refactor is fairly large in scope and I wanted to align on the architectural direction. After #137 landed and confirmed the choice (htmlparser2 + DOM walker), this PR resumes that effort, applying the same approach to the reverse path.

Description

The mirror of #137 in the opposite direction. Replaces the 173-line
regex pipeline in htmlToConfluenceStorage (and through it,
markdownToStorage / markdownToNativeStorage) with an htmlparser2
DOM walker.

The previous regex pipeline had the same structural fragility #137
documented for storageToMarkdown:

Nested <blockquote>(.*?)</blockquote> could not track depth across
nested macros, so > > **INFO** lost its outer closing tag.
CDATA literals containing <blockquote> confused the blockquote
regex from inside code blocks.
The [!info] admonition rewrite had to anchor itself defensively
because it ran on raw markdown text (fix: scope [!info] admonition rewrite to block context #134).
Marker detection was tightly coupled to markdown-it's exact
paragraph serialization shape.

Recent fix commits in this area (#132, #134, #136) were all workarounds
for that structural fragility.

Type of Change

Code refactoring
Performance improvement (single parse pass replaces 10+ regex
passes over the same string)

What changed

File	Change
`lib/html-to-storage.js` (new)	DOM walker with handlers for headings, paragraphs, inline formatting, lists (with `<p>`-wrap quirk on tight items), code blocks (CDATA + language + `]]>` split) and inline code, links (smart/plain/wiki + anchor `#id` branch), tables, blockquote with INFO/WARNING/NOTE marker detection, TOC/ANCHOR/EXPAND paragraph markers, and void tags. Parser uses `decodeEntities: false` so text and attribute entities round-trip byte-identical. Recursion is capped at 256 levels via a typed `HtmlDepthExceededError`.
`lib/macro-converter.js`	`htmlToConfluenceStorage` shrinks 173 → 2 lines, delegating to `htmlToStorage`. `markdownToStorage` and `markdownToNativeStorage` route through it, so all three methods share the walker backend.
`tests/html-to-storage.test.js` (new)	55 unit tests covering each handler in isolation, including depth-guard guarantees (1000-deep throws, 50-deep succeeds).
`tests/macro-converter.test.js`	+5 nested-blockquote regression tests (the original `> > INFO` unbalance bug plus three-level nesting and sibling blockquotes), +4 integration smoke tests (CDATA-adjacent-marker, public `htmlToConfluenceStorage` path, task-list literal-text behavior, table-cell `<p>` wrap).

Testing

548/548 tests pass locally; lint clean.
Existing macro-converter tests act as the public-API regression net,
mirroring refactor(converter): replace regex-based storageToMarkdown with htmlparser2 walker #137's testing approach.
New walker unit tests cover handler-level behavior independently of
markdown-it's serialization shape, including the depth guard.

Test Suites: 13 passed, 13 total
Tests:       548 passed, 548 total

Local validation during development

Beyond the committed test suite, walker development used a 19-case
golden fixture A/B diff harness (16 markdown + 3 HTML, covering all
handler categories — headings, blockquote markers and nesting,
admonitions, CDATA-with-HTML, TOC / ANCHOR / EXPAND, tables, link
styles ×3, task lists, deeply nested lists, mixed content, inline
edges) and reached byte parity 0/19 between the V1 entry point and a
direct walker call.

The fixture infrastructure was removed from this PR following #137's
"existing tests act as regression net" approach; the four new
integration smoke tests in tests/macro-converter.test.js cover the
corners not transitively covered by existing tests.

What is intentionally not changed

The markdown-it preprocessor that rewrites the [!info] shorthand into
the canonical > **INFO** blockquote form (added in #134) stays in
place. It operates on markdown source where line-break information is
needed to support both the [!info]\nbody newline form and the
single-line [!info] body form. Migrating it to the tree level would
lose the single-line form's body delimiter and silently break
backwards compatibility.

Behavioral Notes

For raw-HTML input via --input-format html, the walker's -wrap
behavior on <li> / <th> / <td> differs from the previous regex
pipeline in a few edge cases. The walker decides via inline-tag
membership + newline absence rather than the previous "single-line
regex match" rule:

Improvement (most cases): A single-line <li>existing</li>
(loose-list canonical form, common in Confluence storage round-trips)
no longer gets double-wrapped to <li>existing</li>.
The previous regex captured the existing  and re-wrapped it; the
walker recognizes  as a block child and skips wrap.
Silent divergence (rare): An <li> containing only HTML5 phrasing
elements not covered by the walker's INLINE_TAGS set (e.g.
<meter>, <output>) won't get the  wrap the regex would have
applied. markdown-it doesn't emit these so the markdown→storage path
is unaffected; only hand-authored raw HTML triggers it.

HTML comments () in raw HTML input are dropped during
conversion rather than preserved in the output payload. This diverges
from the previous regex pipeline (which left them untouched), but
Confluence's storage layer strips HTML comments server-side regardless,
so the rendered page is unchanged.

Extra <a> attributes (title, class, etc.) on raw-HTML input are
dropped in wiki linkStyle output (only href is preserved via
<ri:url ri:value>). The previous regex pipeline matched only the
exact <a href="..."> shape and passed any decorated <a> through
unchanged. Affects only Server/DC users explicitly choosing wiki —
Cloud's storage format does not render the wiki form at all (shown as
"unsupported macro" in the editor), so Cloud defaults to smart.

The dominant markdown-it path is byte-identical to the previous regex
pipeline. The CLI --input-format html and --format html paths
continue to work, now backed by the same walker. The [!info]
shorthand path is unchanged from the user's perspective.

Checklist

My code follows the style guidelines of this project (lint clean)
I have performed a self-review of my own code
My changes generate no new warnings

…rser2 walker Markdown→storage conversion was driven by a 173-line regex pipeline over the rendered HTML. Nested blockquotes made the lazy `<blockquote>(.*?)</blockquote>` matcher attach to an inner closing tag, code-block CDATA literals containing `<blockquote>` confused the same matcher, and recent fix commits in this area (the admonition / blockquote / marker patches) were workarounds. Mirrors the parallel storage-walker refactor for the opposite direction. Introduce `lib/html-to-storage.js`: parses with htmlparser2 (`decodeEntities: false` for byte-identical entity round-trip) and walks the DOM dispatching by tag. Each markdown-it block / inline element plus the existing macro markers (TOC / ANCHOR / EXPAND / INFO|WARNING|NOTE callouts) and linkStyle branches (smart / plain / wiki) has a dedicated case. Recursion is capped at 256 levels via a typed `HtmlDepthExceededError`. `htmlToConfluenceStorage` shrinks 173 → 2 lines, delegating to the walker. `markdownToStorage` and `markdownToNativeStorage` route through it, so all three methods share the backend. - The markdown-it `[!info]` shorthand preprocessor stays in place — the markdown-source line-break info it needs for the single-line `[!info] body` form is not recoverable at tree level. - 548 tests pass; lint clean. Coverage adds 5 nested-blockquote regression tests, 4 integration smoke tests, and a 56-test walker suite with depth-guard guarantees.

pchuri · 2026-04-30T11:40:52Z

Thanks for putting this together — LGTM 🚀

As the mirror of #137, this applies the same htmlparser2-based walker approach consistently to the storage direction. Beyond the refactor, it also fixes some real V1 bugs along the way (nested-blockquote unbalance, malformed XML on multi-attribute links, etc.).

What I particularly liked:

55 isolated handler-level unit tests + 9 integration regression tests, with all 548 tests green
HtmlDepthExceededError guards against stack overflow on pathological input
The trust boundary in escapeAttrValue is clearly documented in comments
Intentional V1-parity decisions (e.g. non-normalized  /<img>, EXPAND non-greedy pairing) have their reasoning written down inline, which makes future spelunking much easier

A few small follow-ups (non-blocking):

<ul> / <ol> are missing renderAttrs, so attributes like class are dropped. For consistency with the <table> family, a one-line addition would be worth considering.
HTML comments are silently dropped in raw-HTML input. If intentional, a line in the PR's "Behavioral Notes" section would be helpful.
On wiki linkStyle conversion, extra <a> attributes (title, etc.) are lost. Confluence wouldn't accept them anyway, but explicit documentation would still be nice.

Happy to handle these in a follow-up PR or separate issue. Great work — appreciate the careful approach!

github-actions · 2026-05-01T02:16:29Z

🎉 This PR is included in version 2.1.9 🎉

The release is available on:

Your semantic-release bot 📦🚀

devlimits · 2026-05-04T04:39:32Z

Thanks for the careful review!

Items 2 & 3: Verified the actual behavior on Confluence and added Behavioral Notes accordingly.
Item 1: Fixing in a follow-up PR.

#170) Addresses follow-up 1 from #153: <ul>/<ol>/<li> were the only block-level elements in lib/html-to-storage.js dropping their attributes. Threads renderAttrs(node.attribs) through the same way the <table> family already does. <li> uses the open-variable hoist pattern from <th>/<td> so wrap and unwrap branches share it. Adds 4 regression tests covering wrap path, unwrap path, <ul>/<ol> attributes, and <ol start>.

# [2.5.0](v2.4.0...v2.5.0) (2026-05-06) ### Bug Fixes * **deps:** bump axios to ~1.15.2 to address security advisories ([#174](#174)) ([0a1492b](0a1492b)), closes [GHSA-w9j2-pv#6h63](https://github.com/GHSA-w9j2-pv/issues/6h63) [#173](#173) * **walker:** preserve attributes on <ul>/<ol>/<li> in markdown→storage ([#170](#170)) ([b5c172a](b5c172a)), closes [#153](#153) ### Features * add page version listing and purge commands ([#171](#171)) ([2bd5c37](2bd5c37))

devlimits force-pushed the refactor/tree-based-storage branch from 38fc12c to 237a023 Compare April 30, 2026 04:17

pchuri self-requested a review April 30, 2026 11:36

This comment was marked as resolved.

Sign in to view

pchuri merged commit 73b0524 into pchuri:main Apr 30, 2026
6 checks passed

github-actions Bot added the released label May 1, 2026

devlimits mentioned this pull request May 4, 2026

fix(walker): preserve attributes on <ul>/<ol>/<li> in markdown→storage #170

Merged

7 tasks

pchuri mentioned this pull request May 6, 2026

Add markdown-it plugin for Confluence storage format conversion #172

Closed

15 tasks

pchuri mentioned this pull request May 7, 2026

Add details/summary and HTML block support to accepted markdown. #177

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(converter): replace regex-based markdown→storage with htmlparser2 walker#153

refactor(converter): replace regex-based markdown→storage with htmlparser2 walker#153
pchuri merged 1 commit into
pchuri:mainfrom
devlimits:refactor/tree-based-storage

devlimits commented Apr 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

pchuri commented Apr 30, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

devlimits commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devlimits commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description

Type of Change

What changed

Testing

Local validation during development

What is intentionally not changed

Behavioral Notes

Checklist

Uh oh!

This comment was marked as resolved.

Uh oh!

pchuri commented Apr 30, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

devlimits commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devlimits commented Apr 30, 2026 •

edited

Loading