Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--preserve-entities clashes with --quote-ampersand #207

Closed
pjuhasz opened this issue Apr 26, 2015 · 7 comments
Closed

--preserve-entities clashes with --quote-ampersand #207

pjuhasz opened this issue Apr 26, 2015 · 7 comments
Labels
Milestone

Comments

@pjuhasz
Copy link

pjuhasz commented Apr 26, 2015

Example HTML:

a & b

Command line:
tidy5 --quote-ampersand yes --preserve-entities yes foo.html > tidied.html

I would expect that it convert the unescaped & to & but it does not if --preserve-entities is set.

@pjuhasz
Copy link
Author

pjuhasz commented Apr 26, 2015

The example, corrected:

<html><body>
a & b
</body></html>

@geoffmcl
Copy link
Contributor

@pjuhasz yes, it feels a bit like a clash, but not really...

To look at it another way. --quote-ampersand yes is the DEFAULT, so when you add --preserve-entities yes you are asking all such entities be preserved, including the ampersand.

If you want to just change the ampersand handling, the --quote-ampersand no will do that.

Maybe it could be changed to --preserve-entities yes only effects all entities except the ampersand, and --quote-ampersand no|yes only effect one, the &.

If you would like to find and fix this, will certainly consider the patch/PR. See lexer.c(1075), pprint.c(750-751)...

Pages like this http://dev.w3.org/html5/html-author/charref show there are hundreds of entities. Maybe we should have --quote-ampersand ONLY deal with one, the &, while --preserve-entities deals with all the rest, so we have no overlap between these two options.

What do others think about this?

@pjuhasz
Copy link
Author

pjuhasz commented Apr 26, 2015

Is a naked ampersand a valid entity?

AFAIK in HTML 4.x and below, naked ampersands are not allowed, they have to be escaped as &amp;, while in HTML 5 they are allowed if they are not ambiguous.

An ambiguous ampersand is a U+0026 AMPERSAND (&) character
 that is not the last character in the file, that is not followed by a space character, 
that is not followed by a start tag that has not been omitted, 
and that is not followed by another U+0026 AMPERSAND (&) character.

So there is also the question of differentiating between ambiguous and unambiguous ampersands, and perhaps treating them differently.

@geoffmcl
Copy link
Contributor

@pjuhasz you make a very good point, but this is separate from clashing options...

Right now tidy defaults to html5++ mode, only falling back to legacy html4-- mode if a doctype indicates that...

So it seems tidy5, still in html5++ mode, should differentiate between ambiguous and unambiguous ampersands, and probably not even issue a warning in the latter case.

So maybe the default --quote-ampersand should be off for this case, and only toggled to on if switching to legacy mode.

See lexer.c(2756) where the switch takes place, and tags.c(739) for the service that does this...

And searching around found another definition an ambiguous ampersand here - http://www.w3.org/TR/html-markup/syntax.html#syntax-ambiguous-ampersand

An ambiguous ampersand is an "&" character followed by one or more characters
in the range "0" to "9", the range "a" to "z", or the range "A" to "Z", 
followed by a ";" (semicolon) character, where these characters do not match 
any of the names given in the “Named character references” section of the 
HTML5 specification [HTML5].

Could you add the URL where you found your quote.

And this would involve considering things like href="u.r.l?a=b&2", src="u.r.l/foo.jpg&h=96", and other examples...

I will continue to read and look at this more soonest, unless you, or someone else, beat me to it with a PR ;=))

@geoffmcl geoffmcl added the Bug label Apr 26, 2015
@geoffmcl geoffmcl added this to the 5.0.0 milestone Apr 26, 2015
@pjuhasz
Copy link
Author

pjuhasz commented Apr 26, 2015

The source of my quote:

https://mathiasbynens.be/notes/ambiguous-ampersands

The author makes other observations about entities without trailing semicolons, encoded urls and ampersands inside attribures that make me want to curl up in a corner.

I've tried to look at the code to try to fix the issue here, but apparently it's more complicated than a five minute touch-and-go, and unfortunately I won't have time for more next week.

@geoffmcl
Copy link
Contributor

@pjuhasz, yes reading your link does make your head spin ;=))

But maybe we are getting too deep here! I constructed a bigger test sample, in_207-2.html. Passed it to the validator, adding what it said, and through Tidy, default config - it warned and fixed all...

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>ambiguous/unambiguous ampersand #207</title>
</head>
<body>
<ul>
<li>Bare unambiguous ampersand - a & b - vok - TWF</li>
<li>Ambiguous or not?</li>
<li>
<ul>
<li>&123 - VF - TWF</li>
<li>foo&<i>bar</i> - vok - TWF</li>
<li>foo&&& bar - vok - TWF</li>
<li><a href="http://example.com/?x=l&y=2">foo</a> - VF - TWF</li>
<li><a href="http://example.com/?x=l&2">bar</a> - VF - TWF</li>
<li>foo &0 bar - VF - TWF</li>
<li>foo &lolwat bar - VF - TWF</li>
</ul>
</li>
<li>Missing semi-colon</li>
<li>
<ul>
<li> &amp - VF - TWF</li>
<li> &copy - VF - TWF</li>
<li> &not - VF - TWF</li>
</ul>
</li>
</ul>
<p>Key - VF=Validator Fail, vok=validator ok, TWF=Tidy Warn and fix</p>
</body>
</html>

In each case, tidy warns and fixes, and the validator agrees except in 3 cases - a & b, foo&<i>bar</i>, and foo&&& bar, which the validator accepts. So it seems initially ONLY in these three cases should we try to fix tidy.

It should be noted that the current default output from tidy PASSES validation completely.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>ambiguous/unambiguous ampersand #207</title>
</head>
<body>
<ul>
<li>Bare unambiguous ampersand - a &amp; b - vok - TWF</li>
<li>Ambiguous or not?</li>
<li>
<ul>
<li>&amp;123 - VF - TWF</li>
<li>foo&amp;<i>bar</i> - vok - TWF</li>
<li>foo&amp;&amp;&amp; bar - vok - TWF</li>
<li><a href="http://example.com/?x=l&amp;y=2">foo</a> - VF -
TWF</li>
<li><a href="http://example.com/?x=l&amp;2">bar</a> - VF - TWF</li>
<li>foo &amp;0 bar - VF - TWF</li>
<li>foo &amp;lolwat bar - VF - TWF</li>
</ul>
</li>
<li>Missing semi-colon</li>
<li>
<ul>
<li>&amp; - VF - TWF</li>
<li>© - VF - TWF</li>
<li>¬ - VF - TWF</li>
</ul>
</li>
</ul>
<p>Key - VF=Validator Fail, vok=validator ok, TWF=Tidy Warn and
fix</p>
</body>
</html>

While it does not mean a lot, the view in a browser appears exactly the same for the input and the output. And interestingly, if appears the browser removes the escape from the links.

So, if we concentrate on these three differences we should be a long way down the road.

I too may not get much time to work on this this week, but will try some things soonest. It would be great if you or others could also experiment...

geoffmcl added a commit that referenced this issue Jun 24, 2015
html5 allows a naked ampersand unquoted, and now tidy will not issue a
warning. This only deals with a & b, and P&<li>O</li>

More may need to be done for other cases.
geoffmcl added a commit that referenced this issue Jun 24, 2015
@geoffmcl
Copy link
Contributor

@pjuhasz, ok firing the first salvo across the bows of this tricky ampersand problems ;=))

Here I ONLY deal with two(2) cases of what has been called an unambiguous ampersand, which are allowed in html5 mode without quoting. The two cases are shown in this sample.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test 3 - unambiguous ampersands</title>
</head>
<body>
<p>f&<i>b</i></p>
<p>A & B</p>
</body>
</html>

Now, if still in html5 mode, tidy will NOT issue a warning for these when found first in lexer.c(1060), and pprint.c(989) will output them as is, whether --quote-ampersand is on or not.

As iterated above there are still other cases to be tested and fixed, but this seems a good start.

Have bumped the version to 4.9.37 for this change.

And am closing this for now as feel this is the bulk of the ampersand problem, but feel free to re-open, or post other specific amerpsand problems.

tidoust added a commit to tidoust/presentation-api that referenced this issue Aug 21, 2015
Recent version 5.0.0 of tidy seems to treat ampersands a bit differently. It
drops the escaping of non-ambiguous ampersands in particular.

That is technically correct in HTML5, see the relevant spec definition in
particular:
http://www.w3.org/TR/html5/syntax.html#syntax-ambiguous-ampersand

I would personally prefer tidy not to update entitites that we chose to escape
though. The "preserve-entities" setting does just that.

This was initially raised in w3c#164
w3c#164 (comment)

FWIW, the change in Tidy may be due to:
htacg/tidy-html5#207
smcv added a commit to smcv/html-tidy that referenced this issue Jul 22, 2016
HTML5 defines an ampersand followed by whitespace to be unambiguously
an ampersand, matching what browsers have always done in practice.
As a result, tidy-html5 does not warn about them when the doctype
is either HTML5 or missing (lack of a DOCTYPE is treated as HTML5,
on the basis that HTML5 is a closer match for what browsers actually
do than any previous standard). Discussion here:
<htacg/tidy-html5#207>

Adding the DOCTYPE throws off some of the line numbering, which needs
adjusting.

t/ignore-text.t also seems to rely on the missing DOCTYPE provoking a
warning, which is obviously not going to happen now that we've
added one, to be able to verify that case-insensitive ignoring
can work. Add a new error so we can ignore that instead.

Signed-off-by: Simon McVittie <smcv@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants