Skip to content

fix(html-to-markdown): preserve indentation inside fenced code blocks#147

Merged
pchuri merged 5 commits intomainfrom
fix/html-to-markdown-fenced-code-indent
Apr 29, 2026
Merged

fix(html-to-markdown): preserve indentation inside fenced code blocks#147
pchuri merged 5 commits intomainfrom
fix/html-to-markdown-fenced-code-indent

Conversation

@pchuri
Copy link
Copy Markdown
Owner

@pchuri pchuri commented Apr 29, 2026

Closes #146.

Problem

htmlToMarkdown applied indent-stripping and multi-space-collapsing regexes (lib/html-to-markdown.js:145-150) to the entire output. Same shape of bug as #139, but on the legacy path still exposed via confluenceClient.htmlToMarkdown() (lib/confluence-client.js:1203).

Repro from #146:

const html = '<pre><code class="language-python">def foo():\n    return 1</code></pre>';
htmlToMarkdown(html);
// Old: "def foo():\nreturn 1"            // indent stripped
// New: "```python\ndef foo():\n    return 1\n```"

Fix

Two changes, mirroring the walker fix from #139:

  1. Convert <pre><code [class="language-X"]>…</code></pre> to fenced blocks. Runs before the inline <code> regex and the catch-all tag strip, so multi-line code reaches cleanup wrapped in triple-backticks rather than flattened to raw text.

  2. Split on fence boundaries in cleanup. The cleanup chain (extracted into cleanupOutsideFence) mirrors StorageWalker.cleanup() post-Walker cleanup() strips leading whitespace inside fenced code blocks #139: split on /(```[\s\S]*?```)/g and apply rules only to non-fenced segments.

(1) is required because (2) alone is a no-op — the legacy pipeline never emitted fences before. This means the PR scope is slightly wider than #146's text (which pointed only at the cleanup chain), but the issue's reproduction can't be fixed without also wrapping <pre><code> into fences.

Behavior changes inside fenced code

Now consistent with the walker post-#139:

  • Trailing whitespace preserved
  • 3+ blank lines no longer collapsed
  • Header blank-line rule no longer applies to # lines in code
  • Inline multi-space preserved
  • Leading whitespace preserved

Test plan

  • All 435 tests pass (423 existing + 12 new)
  • 12 new tests in tests/html-to-markdown.test.js mirroring walker's coverage: 4-space leading indent, 8-space nested indent, tab indent, inline multi-space, no-language fence, non-fence cleanup intact, between-fence cleanup, # comment inside fence, trailing whitespace preserved, blank-line preservation, entity decoding inside fence, inline <code> regression
  • npm run lint clean
  • Manual verification of issue repro

Known limitation

Same as #139: a code body containing a literal triple-backtick would mis-pair fences. Out of scope.

pchuri added 2 commits April 29, 2026 21:00
The legacy htmlToMarkdown regex pipeline applied indent-stripping and
multi-space-collapsing rules to the entire output, corrupting code
blocks exposed via confluenceClient.htmlToMarkdown(). Two changes
mirror the walker fix from #139:

1. Convert <pre><code class="language-X">...</code></pre> to triple-
   backtick fences before the catch-all tag strip, so multi-line code
   reaches the cleanup chain wrapped rather than flattened to raw text.
2. Split on fence boundaries in cleanup (extracted as
   cleanupOutsideFence) so indent-stripping / multi-space-collapsing
   rules only apply to non-fenced segments.

(1) is required because (2) alone is a no-op on the legacy pipeline,
which never emitted fences before.

Closes #146.
Two coverage gaps surfaced by /review:
- empty <pre><code></code></pre> body emits an empty fence
- two back-to-back <pre><code> blocks split cleanly with one blank line

Both behaviors were already correct; tests pin them as regression guards
for future cleanup-chain or fence-split tweaks.
pchuri added 3 commits April 29, 2026 21:24
…ticks

The new <pre><code> rule always emitted a 3-backtick fence, so a body
containing ``` (e.g. a Confluence page documenting Markdown syntax)
prematurely closed its own fence and produced malformed CommonMark
output.

Use the smallest fence length N≥3 such that the body contains no run
of N backticks (max run + 1, floor 3). The cleanup-split now uses a
backreference (`{3,})...\1 so it pairs the same N-length open/close,
matching the dynamically-sized fences emitted above.

Walker (#145) carries the same shape as a documented Known Limitation;
fixing it there is tracked separately and will land via the shared
cleanup helper in #149.

Repro: htmlToMarkdown('<pre><code class="language-md">before\n```\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```"  (3 broken blocks)
- After:  "````md\nbefore\n```\nafter\n````"  (one valid fence)
…sizing

fenceLength() previously ran on the raw body before the entity-decode
pass, so payloads with backticks expressed as `&#96;` / `&#x60;` were
sized as if the body had no backticks. The chosen 3-backtick fence then
got broken by the decode pass that runs later in the pipeline.

Pre-decode numeric entities for sizing purposes only — the body is
still emitted with entities preserved, since the pipeline's own decode
pass produces the final output.

Repro: htmlToMarkdown('<pre><code class="language-md">before\n&#96;&#96;&#96;\nafter</code></pre>')
- Before: "```md\nbefore\n```\nafter\n```"   (3 broken blocks post-decode)
- After:  "````md\nbefore\n```\nafter\n````" (one valid 4-backtick fence)
splitOnFences() matched any same-length backtick pair anywhere in the
text, so prose containing mid-line ``` (e.g. a paragraph documenting
Markdown syntax) was paired with the next code block's opening fence.
That mis-classification dragged the code body into the "outside" segment
where cleanup ran on it, dropping the separating space and — worse —
re-stripping leading indentation inside the actual code block, fully
regressing the original #146 fix for any page with prose ``` before a
code block.

Anchor opening and closing to line boundaries (^/$ with m flag) per
CommonMark: opens with up to 3 spaces of indent + 3+ backticks at line
start, closes with equal-length backticks followed only by whitespace
to line end.

Repro:
- Before: htmlToMarkdown('<p>before ``` after</p><pre><code class="language-py">def foo():\n    return 1</code></pre>')
          → "before``` after\n\n```py\ndef foo():\nreturn 1\n```"  (space gone, indent gone)
- After:  same input → "before ``` after\n\n```py\ndef foo():\n    return 1\n```"
@pchuri pchuri merged commit 044e46b into main Apr 29, 2026
6 checks passed
@pchuri pchuri deleted the fix/html-to-markdown-fenced-code-indent branch April 29, 2026 12:54
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
## [2.1.5](v2.1.4...v2.1.5) (2026-04-29)

### Bug Fixes

* **html-to-markdown:** preserve indentation inside fenced code blocks ([#147](#147)) ([044e46b](044e46b)), closes [#139](#139) [#146](#146) [#145](#145) [#149](#149) [#96](#96) [#x60](https://github.com/pchuri/confluence-cli/issues/x60) [#96](#96) [#96](#96) [#96](#96)
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 2.1.5 🎉

The release is available on:

Your semantic-release bot 📦🚀

pchuri added a commit that referenced this pull request Apr 29, 2026
…#151)

Add lib/markdown-cleanup.js exposing fenceLength, splitOnFences,
cleanupOutsideFence, and cleanupWithFences. html-to-markdown and
storage-walker now delegate cleanup to the shared helpers, removing
the drift hazard between the two converters.

Walker handleCode / handleMermaid use dynamic fenceLength against the
entity-decoded body, fixing three latent fence bugs that #147 fixed
in html-to-markdown but had not yet been ported to the walker:

- payload containing literal ``` closes its own fence
- payload with &#96; / &#x60; numeric entities sized before decode
- prose with mid-line ``` mis-paired with a real fence opening,
  dragging the code body into the outside-fence segment and
  regressing #146 indent preservation

Adds 36 new tests (7 walker regression + 29 helper unit tests).
Total suite: 483 passing.

Closes #149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

html-to-markdown.js: same fenced-code indent corruption as walker (#139)

1 participant