Skip to content

fix: preserve line breaks when converting HTML to markdown#79

Merged
pchuri merged 1 commit intopchuri:mainfrom
hodlen:fix/md-line-break
Mar 19, 2026
Merged

fix: preserve line breaks when converting HTML to markdown#79
pchuri merged 1 commit intopchuri:mainfrom
hodlen:fix/md-line-break

Conversation

@hodlen
Copy link
Contributor

@hodlen hodlen commented Mar 16, 2026

Pull Request Template

Description

Fixes content being dropped and block elements running together when converting Confluence storage format to markdown (read --format markdown).

Root causes:

  • <p> regex was missing the s (dotAll) flag — paragraph content with embedded newlines was silently dropped
  • Block elements (paragraphs, code blocks, lists, tables) used no surrounding newlines, causing adjacent blocks to concatenate without separation

Fix: each block element now emits \n…\n, so adjacent blocks naturally produce a blank line between them.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • Tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Markdown output formatting issues when converting Confluence storage-format HTML to Markdown, specifically preserving multi-line paragraph content and ensuring block-level elements don’t concatenate without blank lines.

Changes:

  • Wrap Confluence code-macro conversions in surrounding newlines so adjacent blocks naturally separate.
  • Update <p> conversion to use the dotAll regex flag to preserve paragraph content containing embedded newlines, and emit surrounding newlines.
  • Add unit tests covering block separation (code/mermaid/lists/tables) and multi-line paragraph preservation for both storageToMarkdown() and htmlToMarkdown().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/confluence-client.test.js Adds regression tests for multi-line paragraphs and blank-line separation between block elements.
lib/confluence-client.js Adjusts code-macro and paragraph conversions to preserve line breaks and introduce blank-line separation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1294 to 1302
// Convert Confluence code macros to markdown
markdown = markdown.replace(/<ac:structured-macro ac:name="code"[^>]*>[\s\S]*?<ac:parameter ac:name="language">([^<]*)<\/ac:parameter>[\s\S]*?<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>[\s\S]*?<\/ac:structured-macro>/g, (_, lang, code) => {
return `\`\`\`${lang}\n${code}\n\`\`\``;
return `\n\`\`\`${lang}\n${code}\n\`\`\`\n`;
});

// Convert code macros without language parameter
markdown = markdown.replace(/<ac:structured-macro ac:name="code"[^>]*>[\s\S]*?<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>[\s\S]*?<\/ac:structured-macro>/g, (_, code) => {
return `\`\`\`\n${code}\n\`\`\``;
return `\n\`\`\`\n${code}\n\`\`\`\n`;
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is out of the change's scope.

Copy link
Owner

@pchuri pchuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The dotAll flag fix on the <p> regex is a great catch — silently dropping multi-line paragraph content was a subtle but impactful bug. The test coverage is thorough too, with both per-element and complex integration cases.

A few observations:

1. Inconsistent block separation for lists and tables

Code blocks and <p> now emit \n…\n, but <ul>, <ol>, and <table> still use only a leading \n (e.g. '\n' + listItems). This works today because the preceding <p> contributes its trailing \n, but if two block elements appear back-to-back without a <p> in between (e.g. a list immediately followed by a table), there won't be a blank line separating them. Applying the same \n…\n pattern to all block elements would make the output more robust and the code more consistent.

2. Code block content can be mutated by htmlToMarkdown()

(Also flagged by Copilot) storageToMarkdown() converts code macros into fenced Markdown blocks before htmlToMarkdown() runs its catch-all HTML tag stripping (/<(?!\/?(details|summary)\b)[^>]+>/g). This means any <div>, <span>, etc. inside code examples will be silently removed. Not necessarily in scope for this PR, but worth a follow-up — e.g. replacing fenced blocks with placeholder tokens before the HTML strip pass and restoring them afterward.

3. Minor: leading \n on first <p>

Adding a leading \n to every <p> means the very first element produces an extra newline at the start of the output. The final markdown.trim() handles this, so there's no user-visible issue — just something to be aware of in the intermediate state.


Overall this is a solid fix. Once the list/table block separation consistency (item 1) is addressed (or confirmed acceptable), this looks good to merge.

@hodlen
Copy link
Contributor Author

hodlen commented Mar 18, 2026

Thanks for the thorough review!

Per observations 1, the implicit trailing \n from each list item means '\n' + listItems already produces '\n…\n' — same shape as paragraphs and tables. Adjacent blocks (e.g. a list followed by a table) naturally combine to \n\n in between without a <p> intermediary. We can add a test to verify this behavior and avoid regression if necessary.

Observations 2 and 3 are both valid, but they're beyond the scope of this fix and would warrant a more substantial restructuring of the conversion pipeline. Happy to track them as separate issues if that's useful.

Copy link
Owner

@pchuri pchuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the list items — the implicit trailing newline does give us the same shape, so no change needed there. And agreed on 2 & 3 being separate concerns. LGTM!

pchuri

This comment was marked as duplicate.

@pchuri pchuri merged commit c39f388 into pchuri:main Mar 19, 2026
10 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 19, 2026
## [1.27.4](v1.27.3...v1.27.4) (2026-03-19)

### Bug Fixes

* preserve line breaks when converting HTML to markdown ([#79](#79)) ([c39f388](c39f388))
@github-actions
Copy link

🎉 This PR is included in version 1.27.4 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants