Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect smart quote conversion for input HTML with markup #3424

Open
lisaah opened this issue Feb 6, 2017 · 3 comments
Open

Incorrect smart quote conversion for input HTML with markup #3424

lisaah opened this issue Feb 6, 2017 · 3 comments

Comments

@lisaah
Copy link

lisaah commented Feb 6, 2017

Version: 1.19.2.1
Command: pandoc -o output.html input.html -S

input.html

"Hello <em>world</em>..." "Hello world..."

"Hello world..." "Hello world..."

output.html

&quot;Hello <em>world</em>…&quot; “Hello world…”

"Hello world…" “Hello world…”

Expected:

“Hello <em>world</em>…” “Hello world…”

“Hello world…” “Hello world…”

Having any html tag seems to break the smart quote conversion. The ellipsis conversion seems fine. The smart quotes seem to convert correctly when a markdown version of this is used (e.g. "Hello *world*..." "Hello world..."). Am I missing something or is this a bug?

@jgm
Copy link
Owner

jgm commented Feb 6, 2017

As a workaround try converting to markdown without --smart, then converting the result with --smart.

@lisaah
Copy link
Author

lisaah commented Feb 7, 2017

Hah, yep, that workaround will do for now. Thanks for looking into it!

@jgm
Copy link
Owner

jgm commented Jan 11, 2019

The issue is that in the HTML reader we apply smartPunctuation only in parsing tag contents. So it only works, as it were, between tags. The reasons for this are a bit complex: the HTML reader parses a string of tokens produced by an HTML5 tokenizer. So we can't use our existing smart punctuation code, which operates on strings, on that -- but we can use it on the tag contents.

One could simply duplicate smart punctuation parsing logic using token parsers in the HTML reader, for a better solution. (Crossed out 'simply' because it's not that simple; I guess it would require splitting tag contents so that quotes were separately recognizable tokens.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants