New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tidy fails if html contains a section <![endif]—> #487
Comments
@rkopaliani thanks for your issue, but I am confused over a few things here...
After a few attempts I was able to construct in input which passes W3C validation - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<title>Issue #487-2</title>
<!--[if mso]>
<![end if]-->
<body>
<p> Yada yada yada </p>
</body>
</html> And the latest Tidy also has no problem with this - tidy input5\in_487-2.html
Info: Doctype given is "-//W3C//DTD HTML 4.01 Transitional//EN"
Info: Document content looks like HTML 4.01 Strict
No warnings or errors were found.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.3.15">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Issue #487-2</title>
<!--[if mso]>
<![end if]-->
</head>
<body>
<p>Yada yada yada</p>
</body>
</html>
About HTML Tidy: https://github.com/htacg/tidy-html5
etc... But could never get
And as you point out, tidy just eats all the Tidy also has no problem if I put the I have also re-tested issue #153, with current tidy, and there seems still no problem there, so maybe these are not related... So I am afraid you really need to clarify what you want to do! The premier thing would be to give a simple sample document that passes https://validator.w3.org/#validate_by_upload, then show what tidy does wrong, and what you expect of tidy... thanks... |
Hi again @geoffmcl
I am totally new to Tidy, so just to clarify. It is expected of Tidy to cut out content in body after
because this html is not valid itself ? |
@rkopaliani, are you trying to suggest that Tidy should catch instances of em-dash where hyphens are supposed to be used? Are you using some non-ASCII text editor that's converting I agree that Tidy should report some type of error when encountering what is actually garbage, though. |
@balthisar yep, that's actually what I'm trying to suggest, sorry for being unclear. I try to use tidylib to clean up htmls for my project, I don't have any control of what kind of html I receive, unfortunately. |
@rkopaliani, @balthisar, thanks for the added comments... I have now had time to do a careful MSVC debug session, using your sample case, for a few hours, very slow, meticulous and tedious, and here is what I found, in detail... In lexer.c, in GetTokenFromStream(), on finding a
Now if the next character is a hyphen, But in this case it is an em dash, dec 151, 0x97, which, if the If this next had been an But with this On reaching the next Tidy will now swallow the newline, and leading spaces to get to the next char, which is another Since the next is a The next character is an So on getting the
So the 0x97, dec 151, em dash becomes Win2Unicode[151 - 128], ie 0x2014. I do not know if the table is right or wrong, but this page - http://www.fileformat.info/info/unicode/char//2014/index.htm - seems to suggest that it is correct... Now the weird part! Tidy just puts this 0x2014 back - BUT the problem, in this case, is it has already passed the At that time tidy was in So, in addition to the above warning about eaten content, I can at least see another warning, something like, "Reached EOF in the get 'section' state, while looking for Or alternatively, know, or keep the memory, that we have seen a And also need to ponder on why the first, agreed badly formed so called On some other thing read in the comments...
More on item 2. There are some utf-8 checkers out there. I have some messy code in https://github.com/geoffmcl/utf8-test, but I am sure there may be others... Tidy defaults to utf-8, and will scream if passed an invalid sequence, 0x97 in this case, unless explicitly configured to expect other than fully correct utf-8 sequences. So you would need some pre-file processing before passing this to tidy, with the correct So yes, I now see this as a bug, and marking it so. But as stated need to think a little about a solution, or at least a clear warning or two, why most of the document was thrown away... Further comments would be appreciated... thanks... |
@rkopaliani, @balthisar, @ralfjunker, I just noticed this is virtually the same as #462... and there a solution has been suggested, namely that tidy should allow Now that we have a crossrefernce, I this think one or the other should be closed, while we decide a good solution for this... Seek ideas, comments, patches or PR... thanks... |
As we are about to release 5.4, moving this out to |
As appears still open, moving the milestone out yet again... |
@rkopaliani, @balthisar, @ralfjunker, please note the Maybe this will fix, change, this? Seek ideas, comments, testing, confirmation, patches, PR, etc... thanks... |
@rkopaliani, @balthisar, @ralfjunker, have done some preliminary re-testing... with good results... WOW, re-running the first given sample, in_478.html, using an issue #462 patched tidy... a surprise! This turns out to be TWO, 2, different
So the patch in #462 solves the second, and by this fuller, additional, patch, solves BOTH - diff --git a/src/lexer.c b/src/lexer.c
index ef70e13..49b74f5 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -2777,6 +2777,7 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
}
+ TY_(Report)(doc, NULL, NULL, MALFORMED_COMMENT_DROPPING); /* Is. #487 */
/* else swallow characters up to and including next '>' */
while ((c = TY_(ReadChar)(doc->docIn)) != '>')
@@ -3340,7 +3341,11 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
}
}
- if (c != ']')
+ if (c == '>')
+ {
+ /* Is. #462 - reached '>' before ']' */
+ TY_(UngetChar)(c, doc->docIn);
+ } else if (c != ']')
continue;
/* now look for '>' */ While one might argue, quibble, ... that the Will try to get around to incorporating this into a combined PR... unless someone beats me to it... could use some help here... Look forward to further feedback, comments, testing, confirmation, other/alternate patches, PR, etc... thanks... |
The warning message could perhaps be better worded, and maybe there should be another msg when a '>' is encountered while looking for a ']' in a MS Word section, and perhaps the section should be discarded... And perhaps it should be an error, to force the user to fix... But the fix is good as it is, and these issues can be dealt with later... And this fix is piggy backed on this PR, but it is likewise related to 'word-2000' option...
* Is. #896 - make 'bear' docs match code * Is. #487 #462 add warn msg and do not get stuck until eof The warning message could perhaps be better worded, and maybe there should be another msg when a '>' is encountered while looking for a ']' in a MS Word section, and perhaps the section should be discarded... And perhaps it should be an error, to force the user to fix... But the fix is good as it is, and these issues can be dealt with later... And this fix is piggy backed on this PR, but it is likewise related to 'word-2000' option...
@rkopaliani fix now merged, so closing this... thanks... |
Hello. First of all thank you for great work.
I found a little bug almost identical to this old one #153, but looks like here problem is with em dash.
I've constructed a little example
<html>
<title> Some title </title>
<!—[if mso]>
<![end if]—>
<body>
<p> Yada yada yada </p>
</body>
</html>
In this case all content in body will be cut off.
The text was updated successfully, but these errors were encountered: