Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown writer creates nested emph and strong sections #9521

Open
CodeSandwich opened this issue Feb 27, 2024 · 3 comments
Open

Markdown writer creates nested emph and strong sections #9521

CodeSandwich opened this issue Feb 27, 2024 · 3 comments

Comments

@CodeSandwich
Copy link

CodeSandwich commented Feb 27, 2024

Explain the problem.
The markdown writer doesn't catch nested emphasized and strong sections, leading to invalid formatting. Examples:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*A*B*C*
# What the result is for the markdown reader:
echo '*A*B*C*' | pandoc -f markdown -t html
<p><em>A</em>B<em>C</em></p>
echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
****A****
# What the result is for the markdown reader:
echo `****A****` | pandoc -f markdown -t html
<p>****A****</p>
echo '<em><em>A</em></em>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>
echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
**A*B*
# What the result is for the markdown reader:
echo '**A*B*' | pandoc -f markdown -t html
<p>**A<em>B</em></p>

Ideally the formatting state should be tracked and nested formatting that doesn't introduce any additional formatting should be a no-op.

Pandoc version?
Linux
pandoc 3.1.11.1
Features: +server +lua
Scripting engine: Lua 5.4

@jgm
Copy link
Owner

jgm commented Feb 28, 2024

Note that these nestings will work for commonmark and derivatives (gfm etc.).
And we use the same writer (with parameters) for markdown and commonmark.

We could either try to make the markdown parser smarter about these nestings...
or adjust the writer so that, when it's producing pandoc markdown, it works around these issues, perhaps by using a _ for the outer emphasis.

If I recall correctly, the markdown parser was changed to ignore sequences of >= 4 *s in order to avoid exponential performance issues that can arise.

@jgm
Copy link
Owner

jgm commented Feb 28, 2024

Note also that the 3rd example will also cause problems for commonmark.

@CodeSandwich
Copy link
Author

CodeSandwich commented Feb 28, 2024

I don't think that it can be solved in the reader. The markdown syntax by design can't convey nested tags, and without a new syntax the meaning of * and ** can only be inferred based on the context. For example what does *A*B*C* mean? Should the A*B part go deeper into the nesting or should it close the emphasized part? I think that the current approach which is to close the emphasis is the sane one.

I think that the writer can simply drop the inner formatting information. It will be lossy, but only for the structure, not for what the user will see after rendering. If this is the desired approach, then the above examples should behave like this, which IMO seems reasonable:

echo '<em>A<em>B</em>C</em>' | pandoc -f html -t markdown
*ABC*
# What the result is for the markdown reader:
echo '*ABC*' | pandoc -f markdown -t html
<p><em>ABC</em></p>
echo '<strong><strong>A</strong></strong>' | pandoc -f html -t markdown
**A**
# What the result is for the markdown reader:
echo `**A**` | pandoc -f markdown -t html
<p><strong>A</strong></p>
echo '<em><em>A</em></em>' | pandoc -f html -t markdown
*A*
# What the result is for the markdown reader:
echo `*A*` | pandoc -f markdown -t html
<p><em>A</em></p>
echo '<em><em>A</em>B</em>' | pandoc -f html -t markdown
*AB*
# What the result is for the markdown reader:
echo '*AB*' | pandoc -f markdown -t html
<p><em>AB</em></p>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants