Skip to content

fix(walker): preserve indentation inside fenced code blocks#145

Merged
pchuri merged 3 commits intomainfrom
fix/walker-cleanup-fenced-code-indent
Apr 29, 2026
Merged

fix(walker): preserve indentation inside fenced code blocks#145
pchuri merged 3 commits intomainfrom
fix/walker-cleanup-fenced-code-indent

Conversation

@pchuri
Copy link
Copy Markdown
Owner

@pchuri pchuri commented Apr 29, 2026

Closes #139.

Problem

StorageWalker.cleanup() applied indent-stripping and multi-space-collapsing regexes to the entire converter output. This corrupted indentation-sensitive content inside fenced code blocks:

Input (CDATA body) Old output New output
def foo():\n return 1 def foo():\nreturn 1 def foo():\n return 1
func f() {\n\treturn 1\n} tab dropped tab preserved
a b c a b c a b c
# comment\nx = 1 extra blank line injected by header rule preserved verbatim
Mermaid graph TD\n A --> B indent dropped indent preserved

Python/Go/Mermaid pages stopped round-tripping to runnable/renderable markdown.

Fix

Split output on triple-backtick fence boundaries (/(\``[\s\S]*?```)/g) and apply cleanup rules only to non-fenced segments. The walker only emits triple-backtick fences (lib/storage-walker.js:263, :299`), so the split is unambiguous.

Cleanup rules now skipped inside fences:

Behavior changes inside fenced code

  • Trailing whitespace is now preserved. Previously cleanup() stripped trailing [ \t]+ from every line including code. Inside fences this is now left untouched — code is the source of truth, not markdown convention.
  • Blank-line collapsing no longer applies. A code block with three or more consecutive blank lines keeps them.

These follow from the same principle: don't reshape code semantics under the guise of markdown cleanup.

Known limitation

If a code body itself contains a literal triple-backtick (e.g. a Confluence code macro that documents fenced-code syntax), the lazy split mis-pairs the inner backticks with the outer fence and applies cleanup to the misidentified "between" segment. This is a fundamental ambiguity of triple-backtick fences and would require stateful fence tracking to fully resolve. Not addressed here — extremely rare in practice and out of scope for this fix.

Test plan

  • Existing 413 tests pass
  • 10 new tests in tests/macro-converter.test.js cover: 4-space leading indent, 8-space nested indent, tab indent, inline multi-space, mermaid macro, non-fence still strips, cleanup still applies between fences, # comment inside fence, trailing whitespace preservation, consecutive blank-line preservation
  • npm run lint clean
  • All CI checks green (Node 18/20/22 + security)

Out of scope

The same bug exists in lib/html-to-markdown.js:146 (the legacy regex pipeline still exposed via confluence-client.js:1203). Tracked separately in #146 per #139's scope ("Source: lib/storage-walker.js").

🤖 Generated with Claude Code

pchuri and others added 3 commits April 29, 2026 20:35
cleanup() applied global indent-stripping and multi-space collapsing
across the entire output, which corrupted Python/Go/Mermaid code
blocks (leading indent removed, inline multi-space collapsed). Split
on triple-backtick fence boundaries and skip cleanup inside fenced
segments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover two regression risks the initial fix exposed: (1) Python `#`
comment lines inside a fenced block must not trigger the header
blank-line rule, (2) trailing whitespace inside fenced code is
intentionally preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backs the PR's behavior-change claim that 3+ consecutive blank lines
inside a code block are no longer collapsed to two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pchuri pchuri merged commit 709aaee into main Apr 29, 2026
6 checks passed
@pchuri pchuri deleted the fix/walker-cleanup-fenced-code-indent branch April 29, 2026 11:50
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
## [2.1.4](v2.1.3...v2.1.4) (2026-04-29)

### Bug Fixes

* **walker:** preserve indentation inside fenced code blocks ([#145](#145)) ([709aaee](709aaee)), closes [#139](#139)
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 2.1.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

pchuri added a commit that referenced this pull request Apr 29, 2026
…ticks

The new <pre><code> rule always emitted a 3-backtick fence, so a body
containing ``` (e.g. a Confluence page documenting Markdown syntax)
prematurely closed its own fence and produced malformed CommonMark
output.

Use the smallest fence length N≥3 such that the body contains no run
of N backticks (max run + 1, floor 3). The cleanup-split now uses a
backreference (`{3,})...\1 so it pairs the same N-length open/close,
matching the dynamically-sized fences emitted above.

Walker (#145) carries the same shape as a documented Known Limitation;
fixing it there is tracked separately and will land via the shared
cleanup helper in #149.

Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```"  (3 broken blocks)
- After:  "````md\nbefore\n```\nafter\n````"  (one valid fence)
pchuri added a commit that referenced this pull request Apr 29, 2026
…#147)

* fix(html-to-markdown): preserve indentation inside fenced code blocks

The legacy htmlToMarkdown regex pipeline applied indent-stripping and
multi-space-collapsing rules to the entire output, corrupting code
blocks exposed via confluenceClient.htmlToMarkdown(). Two changes
mirror the walker fix from #139:

1. Convert <pre><code class="language-X">...</code></pre> to triple-
   backtick fences before the catch-all tag strip, so multi-line code
   reaches the cleanup chain wrapped rather than flattened to raw text.
2. Split on fence boundaries in cleanup (extracted as
   cleanupOutsideFence) so indent-stripping / multi-space-collapsing
   rules only apply to non-fenced segments.

(1) is required because (2) alone is a no-op on the legacy pipeline,
which never emitted fences before.

Closes #146.

* test(html-to-markdown): pin empty body and adjacent fenced blocks

Two coverage gaps surfaced by /review:
- empty <pre><code></code></pre> body emits an empty fence
- two back-to-back <pre><code> blocks split cleanly with one blank line

Both behaviors were already correct; tests pin them as regression guards
for future cleanup-chain or fence-split tweaks.

* fix(html-to-markdown): size fence dynamically when body contains backticks

The new <pre><code> rule always emitted a 3-backtick fence, so a body
containing ``` (e.g. a Confluence page documenting Markdown syntax)
prematurely closed its own fence and produced malformed CommonMark
output.

Use the smallest fence length N≥3 such that the body contains no run
of N backticks (max run + 1, floor 3). The cleanup-split now uses a
backreference (`{3,})...\1 so it pairs the same N-length open/close,
matching the dynamically-sized fences emitted above.

Walker (#145) carries the same shape as a documented Known Limitation;
fixing it there is tracked separately and will land via the shared
cleanup helper in #149.

Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```"  (3 broken blocks)
- After:  "````md\nbefore\n```\nafter\n````"  (one valid fence)

* fix(html-to-markdown): account for entity-decoded backticks in fence sizing

fenceLength() previously ran on the raw body before the entity-decode
pass, so payloads with backticks expressed as `&#96;` / `&#x60;` were
sized as if the body had no backticks. The chosen 3-backtick fence then
got broken by the decode pass that runs later in the pipeline.

Pre-decode numeric entities for sizing purposes only — the body is
still emitted with entities preserved, since the pipeline's own decode
pass produces the final output.

Repro: htmlToMarkdown('<pre><code class="language-md">before\n&#96;&#96;&#96;\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```"   (3 broken blocks post-decode)
- After:  "````md\nbefore\n```\nafter\n````" (one valid 4-backtick fence)

* fix(html-to-markdown): anchor fence detection to line boundaries

splitOnFences() matched any same-length backtick pair anywhere in the
text, so prose containing mid-line ``` (e.g. a paragraph documenting
Markdown syntax) was paired with the next code block's opening fence.
That mis-classification dragged the code body into the "outside" segment
where cleanup ran on it, dropping the separating space and — worse —
re-stripping leading indentation inside the actual code block, fully
regressing the original #146 fix for any page with prose ``` before a
code block.

Anchor opening and closing to line boundaries (^/$ with m flag) per
CommonMark: opens with up to 3 spaces of indent + 3+ backticks at line
start, closes with equal-length backticks followed only by whitespace
to line end.

Repro:
- Before: htmlToMarkdown('<p>before ``` after</p><pre><code class="language-py">def foo():\n    return 1</code></pre>')
          → "before``` after\n\n```py\ndef foo():\nreturn 1\n```"  (space gone, indent gone)
- After:  same input → "before ``` after\n\n```py\ndef foo():\n    return 1\n```"
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
## [2.1.5](v2.1.4...v2.1.5) (2026-04-29)

### Bug Fixes

* **html-to-markdown:** preserve indentation inside fenced code blocks ([#147](#147)) ([044e46b](044e46b)), closes [#139](#139) [#146](#146) [#145](#145) [#149](#149) [#96](#96) [#x60](https://github.com/pchuri/confluence-cli/issues/x60) [#96](#96) [#96](#96) [#96](#96)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Walker cleanup() strips leading whitespace inside fenced code blocks

1 participant