Skip to content

refactor(converter): replace regex-based markdown→storage with htmlparser2 walker#153

Merged
pchuri merged 1 commit into
pchuri:mainfrom
devlimits:refactor/tree-based-storage
Apr 30, 2026
Merged

refactor(converter): replace regex-based markdown→storage with htmlparser2 walker#153
pchuri merged 1 commit into
pchuri:mainfrom
devlimits:refactor/tree-based-storage

Conversation

@devlimits
Copy link
Copy Markdown
Contributor

@devlimits devlimits commented Apr 30, 2026

Background

Companion PR to #137 in the opposite direction. This work was paused as the refactor is fairly large in scope and I wanted to align on the architectural direction. After #137 landed and confirmed the choice (htmlparser2 + DOM walker), this PR resumes that effort, applying the same approach to the reverse path.

Description

The mirror of #137 in the opposite direction. Replaces the 173-line
regex pipeline in htmlToConfluenceStorage (and through it,
markdownToStorage / markdownToNativeStorage) with an htmlparser2
DOM walker.

The previous regex pipeline had the same structural fragility #137
documented for storageToMarkdown:

  • Nested <blockquote>(.*?)</blockquote> could not track depth across
    nested macros, so > > **INFO** lost its outer closing tag.
  • CDATA literals containing <blockquote> confused the blockquote
    regex from inside code blocks.
  • The [!info] admonition rewrite had to anchor itself defensively
    because it ran on raw markdown text (fix: scope [!info] admonition rewrite to block context #134).
  • Marker detection was tightly coupled to markdown-it's exact
    paragraph serialization shape.

Recent fix commits in this area (#132, #134, #136) were all workarounds
for that structural fragility.

Type of Change

  • Code refactoring
  • Performance improvement (single parse pass replaces 10+ regex
    passes over the same string)

What changed

File Change
lib/html-to-storage.js (new) DOM walker with handlers for headings, paragraphs, inline formatting, lists (with <p>-wrap quirk on tight items), code blocks (CDATA + language + ]]> split) and inline code, links (smart/plain/wiki + anchor #id branch), tables, blockquote with INFO/WARNING/NOTE marker detection, TOC/ANCHOR/EXPAND paragraph markers, and void tags. Parser uses decodeEntities: false so text and attribute entities round-trip byte-identical. Recursion is capped at 256 levels via a typed HtmlDepthExceededError.
lib/macro-converter.js htmlToConfluenceStorage shrinks 173 → 2 lines, delegating to htmlToStorage. markdownToStorage and markdownToNativeStorage route through it, so all three methods share the walker backend.
tests/html-to-storage.test.js (new) 55 unit tests covering each handler in isolation, including depth-guard guarantees (1000-deep throws, 50-deep succeeds).
tests/macro-converter.test.js +5 nested-blockquote regression tests (the original > > **INFO** unbalance bug plus three-level nesting and sibling blockquotes), +4 integration smoke tests (CDATA-adjacent-marker, public htmlToConfluenceStorage path, task-list literal-text behavior, table-cell <p> wrap).

Testing

Test Suites: 13 passed, 13 total
Tests:       548 passed, 548 total

Local validation during development

Beyond the committed test suite, walker development used a 19-case
golden fixture A/B diff harness (16 markdown + 3 HTML, covering all
handler categories — headings, blockquote markers and nesting,
admonitions, CDATA-with-HTML, TOC / ANCHOR / EXPAND, tables, link
styles ×3, task lists, deeply nested lists, mixed content, inline
edges) and reached byte parity 0/19 between the V1 entry point and a
direct walker call.

The fixture infrastructure was removed from this PR following #137's
"existing tests act as regression net" approach; the four new
integration smoke tests in tests/macro-converter.test.js cover the
corners not transitively covered by existing tests.

What is intentionally not changed

The markdown-it preprocessor that rewrites the [!info] shorthand into
the canonical > **INFO** blockquote form (added in #134) stays in
place. It operates on markdown source where line-break information is
needed to support both the [!info]\nbody newline form and the
single-line [!info] body form. Migrating it to the tree level would
lose the single-line form's body delimiter and silently break
backwards compatibility.

Behavioral Notes

For raw-HTML input via --input-format html, the walker's <p>-wrap
behavior on <li> / <th> / <td> differs from the previous regex
pipeline in a few edge cases. The walker decides via inline-tag
membership + newline absence rather than the previous "single-line
regex match" rule:

  • Improvement (most cases): A single-line <li><p>existing</p></li>
    (loose-list canonical form, common in Confluence storage round-trips)
    no longer gets double-wrapped to <li><p><p>existing</p></p></li>.
    The previous regex captured the existing <p> and re-wrapped it; the
    walker recognizes <p> as a block child and skips wrap.
  • Silent divergence (rare): An <li> containing only HTML5 phrasing
    elements not covered by the walker's INLINE_TAGS set (e.g.
    <meter>, <output>) won't get the <p> wrap the regex would have
    applied. markdown-it doesn't emit these so the markdown→storage path
    is unaffected; only hand-authored raw HTML triggers it.

HTML comments (<!-- ... -->) in raw HTML input are dropped during
conversion rather than preserved in the output payload. This diverges
from the previous regex pipeline (which left them untouched), but
Confluence's storage layer strips HTML comments server-side regardless,
so the rendered page is unchanged.

Extra <a> attributes (title, class, etc.) on raw-HTML input are
dropped in wiki linkStyle output (only href is preserved via
<ri:url ri:value>). The previous regex pipeline matched only the
exact <a href="..."> shape and passed any decorated <a> through
unchanged. Affects only Server/DC users explicitly choosing wiki
Cloud's storage format does not render the wiki form at all (shown as
"unsupported macro" in the editor), so Cloud defaults to smart.

The dominant markdown-it path is byte-identical to the previous regex
pipeline. The CLI --input-format html and --format html paths
continue to work, now backed by the same walker. The [!info]
shorthand path is unchanged from the user's perspective.

Checklist

  • My code follows the style guidelines of this project (lint clean)
  • I have performed a self-review of my own code
  • My changes generate no new warnings

…rser2 walker

Markdown→storage conversion was driven by a 173-line regex pipeline
over the rendered HTML. Nested blockquotes made the lazy
`<blockquote>(.*?)</blockquote>` matcher attach to an inner closing
tag, code-block CDATA literals containing `<blockquote>` confused the
same matcher, and recent fix commits in this area (the admonition /
blockquote / marker patches) were workarounds. Mirrors the parallel
storage-walker refactor for the opposite direction.

Introduce `lib/html-to-storage.js`: parses with htmlparser2
(`decodeEntities: false` for byte-identical entity round-trip) and
walks the DOM dispatching by tag. Each markdown-it block / inline
element plus the existing macro markers (TOC / ANCHOR / EXPAND /
INFO|WARNING|NOTE callouts) and linkStyle branches (smart / plain /
wiki) has a dedicated case. Recursion is capped at 256 levels via a
typed `HtmlDepthExceededError`.

`htmlToConfluenceStorage` shrinks 173 → 2 lines, delegating to the
walker. `markdownToStorage` and `markdownToNativeStorage` route
through it, so all three methods share the backend.

- The markdown-it `[!info]` shorthand preprocessor stays in place — the
  markdown-source line-break info it needs for the single-line
  `[!info] body` form is not recoverable at tree level.
- 548 tests pass; lint clean. Coverage adds 5 nested-blockquote
  regression tests, 4 integration smoke tests, and a 56-test walker
  suite with depth-guard guarantees.
@devlimits devlimits force-pushed the refactor/tree-based-storage branch from 38fc12c to 237a023 Compare April 30, 2026 04:17
@pchuri pchuri self-requested a review April 30, 2026 11:36
pchuri

This comment was marked as resolved.

@pchuri
Copy link
Copy Markdown
Owner

pchuri commented Apr 30, 2026

Thanks for putting this together — LGTM 🚀

As the mirror of #137, this applies the same htmlparser2-based walker approach consistently to the storage direction. Beyond the refactor, it also fixes some real V1 bugs along the way (nested-blockquote unbalance, malformed XML on multi-attribute links, etc.).

What I particularly liked:

  • 55 isolated handler-level unit tests + 9 integration regression tests, with all 548 tests green
  • HtmlDepthExceededError guards against stack overflow on pathological input
  • The trust boundary in escapeAttrValue is clearly documented in comments
  • Intentional V1-parity decisions (e.g. non-normalized <br>/<img>, EXPAND non-greedy pairing) have their reasoning written down inline, which makes future spelunking much easier

A few small follow-ups (non-blocking):

  1. <ul> / <ol> are missing renderAttrs, so attributes like class are dropped. For consistency with the <table> family, a one-line addition would be worth considering.
  2. HTML comments are silently dropped in raw-HTML input. If intentional, a line in the PR's "Behavioral Notes" section would be helpful.
  3. On wiki linkStyle conversion, extra <a> attributes (title, etc.) are lost. Confluence wouldn't accept them anyway, but explicit documentation would still be nice.

Happy to handle these in a follow-up PR or separate issue. Great work — appreciate the careful approach!

@pchuri pchuri merged commit 73b0524 into pchuri:main Apr 30, 2026
6 checks passed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

🎉 This PR is included in version 2.1.9 🎉

The release is available on:

Your semantic-release bot 📦🚀

@devlimits
Copy link
Copy Markdown
Contributor Author

Thanks for the careful review!

Items 2 & 3: Verified the actual behavior on Confluence and added Behavioral Notes accordingly.
Item 1: Fixing in a follow-up PR.

pchuri pushed a commit that referenced this pull request May 5, 2026
#170)

Addresses follow-up 1 from #153: <ul>/<ol>/<li> were the only block-level elements in lib/html-to-storage.js dropping their attributes. Threads renderAttrs(node.attribs) through the same way the <table> family already does. <li> uses the open-variable hoist pattern from <th>/<td> so wrap and unwrap branches share it.

Adds 4 regression tests covering wrap path, unwrap path, <ul>/<ol> attributes, and <ol start>.
github-actions Bot pushed a commit that referenced this pull request May 6, 2026
# [2.5.0](v2.4.0...v2.5.0) (2026-05-06)

### Bug Fixes

* **deps:** bump axios to ~1.15.2 to address security advisories ([#174](#174)) ([0a1492b](0a1492b)), closes [GHSA-w9j2-pv#6h63](https://github.com/GHSA-w9j2-pv/issues/6h63) [#173](#173)
* **walker:** preserve attributes on <ul>/<ol>/<li> in markdown→storage ([#170](#170)) ([b5c172a](b5c172a)), closes [#153](#153)

### Features

* add page version listing and purge commands ([#171](#171)) ([2bd5c37](2bd5c37))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants