fix(html-to-markdown): preserve indentation inside fenced code blocks#147
Merged
fix(html-to-markdown): preserve indentation inside fenced code blocks#147
Conversation
The legacy htmlToMarkdown regex pipeline applied indent-stripping and multi-space-collapsing rules to the entire output, corrupting code blocks exposed via confluenceClient.htmlToMarkdown(). Two changes mirror the walker fix from #139: 1. Convert <pre><code class="language-X">...</code></pre> to triple- backtick fences before the catch-all tag strip, so multi-line code reaches the cleanup chain wrapped rather than flattened to raw text. 2. Split on fence boundaries in cleanup (extracted as cleanupOutsideFence) so indent-stripping / multi-space-collapsing rules only apply to non-fenced segments. (1) is required because (2) alone is a no-op on the legacy pipeline, which never emitted fences before. Closes #146.
Two coverage gaps surfaced by /review: - empty <pre><code></code></pre> body emits an empty fence - two back-to-back <pre><code> blocks split cleanly with one blank line Both behaviors were already correct; tests pin them as regression guards for future cleanup-chain or fence-split tweaks.
This was referenced Apr 29, 2026
…ticks
The new <pre><code> rule always emitted a 3-backtick fence, so a body
containing ``` (e.g. a Confluence page documenting Markdown syntax)
prematurely closed its own fence and produced malformed CommonMark
output.
Use the smallest fence length N≥3 such that the body contains no run
of N backticks (max run + 1, floor 3). The cleanup-split now uses a
backreference (`{3,})...\1 so it pairs the same N-length open/close,
matching the dynamically-sized fences emitted above.
Walker (#145) carries the same shape as a documented Known Limitation;
fixing it there is tracked separately and will land via the shared
cleanup helper in #149.
Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```" (3 broken blocks)
- After: "````md\nbefore\n```\nafter\n````" (one valid fence)
…sizing fenceLength() previously ran on the raw body before the entity-decode pass, so payloads with backticks expressed as ``` / ``` were sized as if the body had no backticks. The chosen 3-backtick fence then got broken by the decode pass that runs later in the pipeline. Pre-decode numeric entities for sizing purposes only — the body is still emitted with entities preserved, since the pipeline's own decode pass produces the final output. Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>') - Before: "```md\nbefore\n```\nafter\n```" (3 broken blocks post-decode) - After: "````md\nbefore\n```\nafter\n````" (one valid 4-backtick fence)
splitOnFences() matched any same-length backtick pair anywhere in the text, so prose containing mid-line ``` (e.g. a paragraph documenting Markdown syntax) was paired with the next code block's opening fence. That mis-classification dragged the code body into the "outside" segment where cleanup ran on it, dropping the separating space and — worse — re-stripping leading indentation inside the actual code block, fully regressing the original #146 fix for any page with prose ``` before a code block. Anchor opening and closing to line boundaries (^/$ with m flag) per CommonMark: opens with up to 3 spaces of indent + 3+ backticks at line start, closes with equal-length backticks followed only by whitespace to line end. Repro: - Before: htmlToMarkdown('<p>before ``` after</p><pre><code class="language-py">def foo():\n return 1</code></pre>') → "before``` after\n\n```py\ndef foo():\nreturn 1\n```" (space gone, indent gone) - After: same input → "before ``` after\n\n```py\ndef foo():\n return 1\n```"
github-actions Bot
pushed a commit
that referenced
this pull request
Apr 29, 2026
## [2.1.5](v2.1.4...v2.1.5) (2026-04-29) ### Bug Fixes * **html-to-markdown:** preserve indentation inside fenced code blocks ([#147](#147)) ([044e46b](044e46b)), closes [#139](#139) [#146](#146) [#145](#145) [#149](#149) [#96](#96) [#x60](https://github.com/pchuri/confluence-cli/issues/x60) [#96](#96) [#96](#96) [#96](#96)
|
🎉 This PR is included in version 2.1.5 🎉 The release is available on: Your semantic-release bot 📦🚀 |
3 tasks
pchuri
added a commit
that referenced
this pull request
Apr 29, 2026
…#151) Add lib/markdown-cleanup.js exposing fenceLength, splitOnFences, cleanupOutsideFence, and cleanupWithFences. html-to-markdown and storage-walker now delegate cleanup to the shared helpers, removing the drift hazard between the two converters. Walker handleCode / handleMermaid use dynamic fenceLength against the entity-decoded body, fixing three latent fence bugs that #147 fixed in html-to-markdown but had not yet been ported to the walker: - payload containing literal ``` closes its own fence - payload with ` / ` numeric entities sized before decode - prose with mid-line ``` mis-paired with a real fence opening, dragging the code body into the outside-fence segment and regressing #146 indent preservation Adds 36 new tests (7 walker regression + 29 helper unit tests). Total suite: 483 passing. Closes #149
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #146.
Problem
htmlToMarkdownapplied indent-stripping and multi-space-collapsing regexes (lib/html-to-markdown.js:145-150) to the entire output. Same shape of bug as #139, but on the legacy path still exposed viaconfluenceClient.htmlToMarkdown()(lib/confluence-client.js:1203).Repro from #146:
Fix
Two changes, mirroring the walker fix from #139:
Convert
<pre><code [class="language-X"]>…</code></pre>to fenced blocks. Runs before the inline<code>regex and the catch-all tag strip, so multi-line code reaches cleanup wrapped in triple-backticks rather than flattened to raw text.Split on fence boundaries in cleanup. The cleanup chain (extracted into
cleanupOutsideFence) mirrorsStorageWalker.cleanup()post-Walker cleanup() strips leading whitespace inside fenced code blocks #139: split on/(```[\s\S]*?```)/gand apply rules only to non-fenced segments.(1) is required because (2) alone is a no-op — the legacy pipeline never emitted fences before. This means the PR scope is slightly wider than #146's text (which pointed only at the cleanup chain), but the issue's reproduction can't be fixed without also wrapping
<pre><code>into fences.Behavior changes inside fenced code
Now consistent with the walker post-#139:
#lines in codeTest plan
tests/html-to-markdown.test.jsmirroring walker's coverage: 4-space leading indent, 8-space nested indent, tab indent, inline multi-space, no-language fence, non-fence cleanup intact, between-fence cleanup,#comment inside fence, trailing whitespace preserved, blank-line preservation, entity decoding inside fence, inline<code>regressionnpm run lintcleanKnown limitation
Same as #139: a code body containing a literal triple-backtick would mis-pair fences. Out of scope.