fix(walker): preserve indentation inside fenced code blocks#145
Merged
Conversation
cleanup() applied global indent-stripping and multi-space collapsing across the entire output, which corrupted Python/Go/Mermaid code blocks (leading indent removed, inline multi-space collapsed). Split on triple-backtick fence boundaries and skip cleanup inside fenced segments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover two regression risks the initial fix exposed: (1) Python `#` comment lines inside a fenced block must not trigger the header blank-line rule, (2) trailing whitespace inside fenced code is intentionally preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backs the PR's behavior-change claim that 3+ consecutive blank lines inside a code block are no longer collapsed to two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🎉 This PR is included in version 2.1.4 🎉 The release is available on: Your semantic-release bot 📦🚀 |
pchuri
added a commit
that referenced
this pull request
Apr 29, 2026
…ticks
The new <pre><code> rule always emitted a 3-backtick fence, so a body
containing ``` (e.g. a Confluence page documenting Markdown syntax)
prematurely closed its own fence and produced malformed CommonMark
output.
Use the smallest fence length N≥3 such that the body contains no run
of N backticks (max run + 1, floor 3). The cleanup-split now uses a
backreference (`{3,})...\1 so it pairs the same N-length open/close,
matching the dynamically-sized fences emitted above.
Walker (#145) carries the same shape as a documented Known Limitation;
fixing it there is tracked separately and will land via the shared
cleanup helper in #149.
Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```" (3 broken blocks)
- After: "````md\nbefore\n```\nafter\n````" (one valid fence)
pchuri
added a commit
that referenced
this pull request
Apr 29, 2026
…#147) * fix(html-to-markdown): preserve indentation inside fenced code blocks The legacy htmlToMarkdown regex pipeline applied indent-stripping and multi-space-collapsing rules to the entire output, corrupting code blocks exposed via confluenceClient.htmlToMarkdown(). Two changes mirror the walker fix from #139: 1. Convert <pre><code class="language-X">...</code></pre> to triple- backtick fences before the catch-all tag strip, so multi-line code reaches the cleanup chain wrapped rather than flattened to raw text. 2. Split on fence boundaries in cleanup (extracted as cleanupOutsideFence) so indent-stripping / multi-space-collapsing rules only apply to non-fenced segments. (1) is required because (2) alone is a no-op on the legacy pipeline, which never emitted fences before. Closes #146. * test(html-to-markdown): pin empty body and adjacent fenced blocks Two coverage gaps surfaced by /review: - empty <pre><code></code></pre> body emits an empty fence - two back-to-back <pre><code> blocks split cleanly with one blank line Both behaviors were already correct; tests pin them as regression guards for future cleanup-chain or fence-split tweaks. * fix(html-to-markdown): size fence dynamically when body contains backticks The new <pre><code> rule always emitted a 3-backtick fence, so a body containing ``` (e.g. a Confluence page documenting Markdown syntax) prematurely closed its own fence and produced malformed CommonMark output. Use the smallest fence length N≥3 such that the body contains no run of N backticks (max run + 1, floor 3). The cleanup-split now uses a backreference (`{3,})...\1 so it pairs the same N-length open/close, matching the dynamically-sized fences emitted above. Walker (#145) carries the same shape as a documented Known Limitation; fixing it there is tracked separately and will land via the shared cleanup helper in #149. Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>') - Before: "```md\nbefore\n```\nafter\n```" (3 broken blocks) - After: "````md\nbefore\n```\nafter\n````" (one valid fence) * fix(html-to-markdown): account for entity-decoded backticks in fence sizing fenceLength() previously ran on the raw body before the entity-decode pass, so payloads with backticks expressed as ``` / ``` were sized as if the body had no backticks. The chosen 3-backtick fence then got broken by the decode pass that runs later in the pipeline. Pre-decode numeric entities for sizing purposes only — the body is still emitted with entities preserved, since the pipeline's own decode pass produces the final output. Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>') - Before: "```md\nbefore\n```\nafter\n```" (3 broken blocks post-decode) - After: "````md\nbefore\n```\nafter\n````" (one valid 4-backtick fence) * fix(html-to-markdown): anchor fence detection to line boundaries splitOnFences() matched any same-length backtick pair anywhere in the text, so prose containing mid-line ``` (e.g. a paragraph documenting Markdown syntax) was paired with the next code block's opening fence. That mis-classification dragged the code body into the "outside" segment where cleanup ran on it, dropping the separating space and — worse — re-stripping leading indentation inside the actual code block, fully regressing the original #146 fix for any page with prose ``` before a code block. Anchor opening and closing to line boundaries (^/$ with m flag) per CommonMark: opens with up to 3 spaces of indent + 3+ backticks at line start, closes with equal-length backticks followed only by whitespace to line end. Repro: - Before: htmlToMarkdown('<p>before ``` after</p><pre><code class="language-py">def foo():\n return 1</code></pre>') → "before``` after\n\n```py\ndef foo():\nreturn 1\n```" (space gone, indent gone) - After: same input → "before ``` after\n\n```py\ndef foo():\n return 1\n```"
github-actions Bot
pushed a commit
that referenced
this pull request
Apr 29, 2026
## [2.1.5](v2.1.4...v2.1.5) (2026-04-29) ### Bug Fixes * **html-to-markdown:** preserve indentation inside fenced code blocks ([#147](#147)) ([044e46b](044e46b)), closes [#139](#139) [#146](#146) [#145](#145) [#149](#149) [#96](#96) [#x60](https://github.com/pchuri/confluence-cli/issues/x60) [#96](#96) [#96](#96) [#96](#96)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #139.
Problem
StorageWalker.cleanup()applied indent-stripping and multi-space-collapsing regexes to the entire converter output. This corrupted indentation-sensitive content inside fenced code blocks:def foo():\n return 1def foo():\nreturn 1def foo():\n return 1func f() {\n\treturn 1\n}a b ca b ca b c# comment\nx = 1graph TD\n A --> BPython/Go/Mermaid pages stopped round-tripping to runnable/renderable markdown.
Fix
Split output on triple-backtick fence boundaries (
/(\``[\s\S]*?```)/g) and apply cleanup rules only to non-fenced segments. The walker only emits triple-backtick fences (lib/storage-walker.js:263, :299`), so the split is unambiguous.Cleanup rules now skipped inside fences:
#lines in code no longer trigger it)Behavior changes inside fenced code
cleanup()stripped trailing[ \t]+from every line including code. Inside fences this is now left untouched — code is the source of truth, not markdown convention.These follow from the same principle: don't reshape code semantics under the guise of markdown cleanup.
Known limitation
If a code body itself contains a literal triple-backtick (e.g. a Confluence code macro that documents fenced-code syntax), the lazy split mis-pairs the inner backticks with the outer fence and applies cleanup to the misidentified "between" segment. This is a fundamental ambiguity of triple-backtick fences and would require stateful fence tracking to fully resolve. Not addressed here — extremely rare in practice and out of scope for this fix.
Test plan
tests/macro-converter.test.jscover: 4-space leading indent, 8-space nested indent, tab indent, inline multi-space, mermaid macro, non-fence still strips, cleanup still applies between fences,#comment inside fence, trailing whitespace preservation, consecutive blank-line preservationnpm run lintcleanOut of scope
The same bug exists in
lib/html-to-markdown.js:146(the legacy regex pipeline still exposed viaconfluence-client.js:1203). Tracked separately in #146 per #139's scope ("Source: lib/storage-walker.js").🤖 Generated with Claude Code