-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--preserve-entities clashes with --quote-ampersand #207
Comments
The example, corrected: <html><body>
a & b
</body></html> |
@pjuhasz yes, it feels a bit like a To look at it another way. If you want to just change the ampersand handling, the Maybe it could be changed to If you would like to find and fix this, will certainly consider the patch/PR. See lexer.c(1075), pprint.c(750-751)... Pages like this http://dev.w3.org/html5/html-author/charref show there are hundreds of entities. Maybe we should have What do others think about this? |
Is a naked ampersand a valid entity? AFAIK in HTML 4.x and below, naked ampersands are not allowed, they have to be escaped as
So there is also the question of differentiating between ambiguous and unambiguous ampersands, and perhaps treating them differently. |
@pjuhasz you make a very good point, but this is separate from Right now tidy defaults to html5++ mode, only falling back to legacy html4-- mode if a doctype indicates that... So it seems tidy5, still in html5++ mode, should differentiate between ambiguous and unambiguous ampersands, and probably not even issue a warning in the latter case. So maybe the default See lexer.c(2756) where the switch takes place, and tags.c(739) for the service that does this... And searching around found another definition an
Could you add the URL where you found your quote. And this would involve considering things like I will continue to read and look at this more soonest, unless you, or someone else, beat me to it with a PR ;=)) |
The source of my quote: https://mathiasbynens.be/notes/ambiguous-ampersands The author makes other observations about entities without trailing semicolons, encoded urls and ampersands inside attribures that make me want to curl up in a corner. I've tried to look at the code to try to fix the issue here, but apparently it's more complicated than a five minute touch-and-go, and unfortunately I won't have time for more next week. |
@pjuhasz, yes reading your link does make your head spin ;=)) But maybe we are getting too deep here! I constructed a bigger test sample, in_207-2.html. Passed it to the validator, adding what it said, and through Tidy, default config - it warned and fixed all...
In each case, tidy warns and fixes, and the validator agrees except in 3 cases - It should be noted that the current default output from tidy PASSES validation completely.
While it does not mean a lot, the view in a browser appears exactly the same for the input and the output. And interestingly, if appears the browser removes the escape from the links. So, if we concentrate on these three differences we should be a long way down the road. I too may not get much time to work on this this week, but will try some things soonest. It would be great if you or others could also experiment... |
html5 allows a naked ampersand unquoted, and now tidy will not issue a warning. This only deals with a & b, and P&<li>O</li> More may need to be done for other cases.
@pjuhasz, ok firing the first salvo across the bows of this tricky ampersand problems ;=)) Here I ONLY deal with two(2) cases of what has been called an
Now, if still in html5 mode, tidy will NOT issue a warning for these when found first in lexer.c(1060), and pprint.c(989) will output them as is, whether --quote-ampersand is on or not. As iterated above there are still other cases to be tested and fixed, but this seems a good start. Have bumped the version to 4.9.37 for this change. And am closing this for now as feel this is the bulk of the ampersand problem, but feel free to re-open, or post other specific amerpsand problems. |
Recent version 5.0.0 of tidy seems to treat ampersands a bit differently. It drops the escaping of non-ambiguous ampersands in particular. That is technically correct in HTML5, see the relevant spec definition in particular: http://www.w3.org/TR/html5/syntax.html#syntax-ambiguous-ampersand I would personally prefer tidy not to update entitites that we chose to escape though. The "preserve-entities" setting does just that. This was initially raised in w3c#164 w3c#164 (comment) FWIW, the change in Tidy may be due to: htacg/tidy-html5#207
HTML5 defines an ampersand followed by whitespace to be unambiguously an ampersand, matching what browsers have always done in practice. As a result, tidy-html5 does not warn about them when the doctype is either HTML5 or missing (lack of a DOCTYPE is treated as HTML5, on the basis that HTML5 is a closer match for what browsers actually do than any previous standard). Discussion here: <htacg/tidy-html5#207> Adding the DOCTYPE throws off some of the line numbering, which needs adjusting. t/ignore-text.t also seems to rely on the missing DOCTYPE provoking a warning, which is obviously not going to happen now that we've added one, to be able to verify that case-insensitive ignoring can work. Add a new error so we can ignore that instead. Signed-off-by: Simon McVittie <smcv@debian.org>
Example HTML:
a & bCommand line:
tidy5 --quote-ampersand yes --preserve-entities yes foo.html > tidied.html
I would expect that it convert the unescaped & to & but it does not if --preserve-entities is set.
The text was updated successfully, but these errors were encountered: