Markdown writer creates nested emph and strong sections #9521

CodeSandwich · 2024-02-27T15:10:47Z

Explain the problem.
The markdown writer doesn't catch nested emphasized and strong sections, leading to invalid formatting. Examples:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*A*B*C*
# What the result is for the markdown reader:
echo '*A*B*C*' | pandoc -f markdown -t html
<p><em>A</em>B<em>C</em></p>

echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
****A****
# What the result is for the markdown reader:
echo `****A****` | pandoc -f markdown -t html
<p>****A****</p>

echo '<em><em>A</em></em>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>

echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
**A*B*
# What the result is for the markdown reader:
echo '**A*B*' | pandoc -f markdown -t html
<p>**A<em>B</em></p>

Ideally the formatting state should be tracked and nested formatting that doesn't introduce any additional formatting should be a no-op.

Pandoc version?
Linux
pandoc 3.1.11.1
Features: +server +lua
Scripting engine: Lua 5.4

jgm · 2024-02-28T16:31:24Z

Note that these nestings will work for commonmark and derivatives (gfm etc.).
And we use the same writer (with parameters) for markdown and commonmark.

We could either try to make the markdown parser smarter about these nestings...
or adjust the writer so that, when it's producing pandoc markdown, it works around these issues, perhaps by using a _ for the outer emphasis.

If I recall correctly, the markdown parser was changed to ignore sequences of >= 4 *s in order to avoid exponential performance issues that can arise.

jgm · 2024-02-28T16:32:42Z

Note also that the 3rd example will also cause problems for commonmark.

CodeSandwich · 2024-02-28T17:41:16Z

I don't think that it can be solved in the reader. The markdown syntax by design can't convey nested tags, and without a new syntax the meaning of * and ** can only be inferred based on the context. For example what does *A*B*C* mean? Should the A*B part go deeper into the nesting or should it close the emphasized part? I think that the current approach which is to close the emphasis is the sane one.

I think that the writer can simply drop the inner formatting information. It will be lossy, but only for the structure, not for what the user will see after rendering. If this is the desired approach, then the above examples should behave like this, which IMO seems reasonable:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*ABC*
# What the result is for the markdown reader:
echo '*ABC*' | pandoc -f markdown -t html
<p><em>ABC</em></p>

echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>

echo '<em><em>A</em></em>' | pandoc -f html -t markdown
*A*
# What the result is for the markdown reader:
echo `*A*` | pandoc -f markdown -t html
<p><em>A</em></p>

echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
*AB*
# What the result is for the markdown reader:
echo '*AB*' | pandoc -f markdown -t html
<p><em>AB</em></p>

CodeSandwich added the bug label Feb 27, 2024

tarleb added format:HTML reader labels Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown writer creates nested emph and strong sections #9521

Markdown writer creates nested emph and strong sections #9521

CodeSandwich commented Feb 27, 2024 •

edited

jgm commented Feb 28, 2024

jgm commented Feb 28, 2024

CodeSandwich commented Feb 28, 2024 •

edited

Markdown writer creates nested emph and strong sections #9521

Markdown writer creates nested emph and strong sections #9521

Comments

CodeSandwich commented Feb 27, 2024 • edited

jgm commented Feb 28, 2024

jgm commented Feb 28, 2024

CodeSandwich commented Feb 28, 2024 • edited

CodeSandwich commented Feb 27, 2024 •

edited

CodeSandwich commented Feb 28, 2024 •

edited