Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities without semicolon not parsed when written in one word with other text #53

Open
vshabanov opened this issue May 6, 2016 · 3 comments

Comments

@vshabanov
Copy link
Contributor

Not sure about whether it HTML5 compatible but text like &micrometer and rock&amproll should be parsed as µmeter and rock&roll, all major browsers do this.

You could see some test cases in testUnescapeHtml function in https://github.com/vshabanov/fast-tagsoup/blob/master/Text/HTML/TagSoup/Test.hs

@ndmitchell
Copy link
Owner

Thanks for the info, confirmed in all HTML5 compliant browsers, so I expect the spec does say that.

@seagreen
Copy link
Contributor

seagreen commented Jun 8, 2016

For reasons described here (scroll down to "Errors involving fragile syntax constructs") using a named character reference not followed by a semicolon is an error in HTML 5. Here's the actual part of the document describing how to interpret character references -- it definitely requires the following semicolon.

However since tagsoup is designed to deal with real-world HTML and Firefox and Chromium are unescaping &micrometer to µmeter for me it seems reasonable that tagsoup should do the same.

@ndmitchell
Copy link
Owner

The hope was that tagsoup would follow HTML5, and all browser authors would also follow HTML5 - that's the whole purpose of HTML5 - specifying the corner cases. But I guess this is one case where they diverged 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants