Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfm parsing oddity with links and raw HTML #147

Closed
jgm opened this issue Feb 6, 2024 · 6 comments
Closed

gfm parsing oddity with links and raw HTML #147

jgm opened this issue Feb 6, 2024 · 6 comments

Comments

@jgm
Copy link
Owner

jgm commented Feb 6, 2024

Note: this may only affect platforms with CR+LF line endings.

Discussed in jgm/pandoc#9406

Originally posted by TripleCamera February 3, 2024
Hi. I am using pandoc to convert markdown to html. For the following lines:

aaabbb

aaa<span></span>bbb

[link](https://baidu.com)aaabbb

[link](https://baidu.com)aaa<span></span>bbb

When the source language is commonmark, the raw HTML tags are preserved when following a link:

<p>aaabbb</p>
<p>aaa<span></span>bbb</p>
<p><a href="https://baidu.com">link</a>aaabbb</p>
<p><a href="https://baidu.com">link</a>aaa<span></span>bbb</p>

However, when the source language is gfm, they are escaped:

<p>aaabbb</p>
<p>aaa<span></span>bbb</p>
<p><a href="https://baidu.com">link</a>aaabbb</p>
<p><a
href="https://baidu.com">link</a>aaa&lt;span&gt;&lt;/span&gt;bbb</p>

I have read the specs and couldn't find any difference for links & raw HTML. Is this a bug in Pandoc?

@jgm
Copy link
Owner Author

jgm commented Feb 6, 2024

@jgm
Copy link
Owner Author

jgm commented Feb 6, 2024

As noted in the linked discussion, this only affects parsing with CR+LF line endings.

The issue may be related to #136

@jgm jgm changed the title gfm/commonmark parsing oddity with links and raw HTML gfm parsing oddity with links and raw HTML Feb 6, 2024
@jgm
Copy link
Owner Author

jgm commented Feb 6, 2024

Observations:

  1. This bug appears with -f gfm but NOT -f commonmark. So it has to do with an extension. Need to isolate which extension with further testing.

  2. The bug is in commonmark-hs, not pandoc itself.

    % echo -e "[link](https://baidu.com)aaa<span></span>bbb\n" | commonmark -xgfm
    <p><a href="https://baidu.com">link</a>aaa&lt;span&gt;&lt;/span&gt;bbb</p>
    
  3. I can reproduce it even with LF line endings using commonmark-cli, so I'm not sure why things seem different with pandoc.

I will transfer this to commonmark-hs.

@jgm jgm transferred this issue from jgm/pandoc Feb 6, 2024
@jgm
Copy link
Owner Author

jgm commented Feb 6, 2024

Using -xautolinks instead of -xgfm produces the issue. So it can be attributed to the autolinks extension.

jgm added a commit that referenced this issue Feb 7, 2024
@jgm
Copy link
Owner Author

jgm commented Feb 7, 2024

The code for the autolinks extension is quite bad and needs work!
There is an extensive set of tests here that we might attend to.
And here is a syntax: https://unifiedjs.com/explore/package/micromark-extension-gfm-autolink-literal/#syntax

Some work in issue147 branch.

@jgm jgm closed this as completed in f0b9653 Feb 7, 2024
@TripleCamera
Copy link

TripleCamera commented Feb 12, 2024

Thank you. 🥰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants