New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Malformed" Word 2000 sequence may cause Tidy to skip document content #462
Comments
@ralfjunker, thanks for the report... Dealing with If possible, please give a minimal Not sure why you include other Tidy parser's, like ASP, JSTE, PHP, ... advise more... But more specific information would also be helpful, like where, how, was this |
Here is a minimal HTML document example: <!-- Silence warnings. Doctype and title do not matter to the problem. -->
<!DOCTYPE html>
<title>Word 2000 Problem</title>
<![endif] EXTRA >
<p>Content</p> I run Tidy 5.2.0 on Windows without options, but GIT trunk behaves the same: tidy.exe file.htm This is the output I receive:
Notice that the The actual problem is in Lines 3196 to 3236 in fd0ccb2
I was able to work around the problem by introducing the previously mentioned new @@ -3195,6 +3195,11 @@
}
}
+ if (!cfgBool(doc, TidyParseWord2000)) {
+ lexer->lexsize--;
+ TY_(AddCharToLexer)(lexer, '<');
+ TY_(AddCharToLexer)(lexer, '!');
+ TY_(AddCharToLexer)(lexer, '[');
+ TY_(AddCharToLexer)(lexer, c);
+ lexer->state = LEX_CONTENT;
+ continue;
+ }
+
if (c != ']')
continue; Changes to PS: The problem was presented to me by a customer as part of a larger document. I have no idea how it was generated. However, the document origin should not matter as such documents do (or will, eventually) exist in the wild so we should be prepared to handle them well. |
@ralfjunker just noted that this topic has more or less been raised again in #487... As suggested there one or the other should be closed while we seek a suitable solution for this And note maybe there can also be Seek ideas, comments, patches or PR... thanks... |
As we are about to release 5.4, moving this out to |
Although no further comments in a long time, appears still open, although also seems duplicate of #487, so moving out the milestone again... |
@ralfjunker took another look at this... and have another idea... Now do not think there is any need for a I was stuck by the comment, in the code, after receiving
Now the issue here is a malformed word 2000 embeded In the process can pass over multiple So on re-reading the this isn't quite right comment, decided maybe this is a BUG, and looked a Came up with a diff --git a/src/lexer.c b/src/lexer.c
index ef70e13..61c28eb 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -3340,7 +3340,11 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
}
}
- if (c != ']')
+ if (c == '>')
+ {
+ /* Is. #462 - reached '>' before ']' */
+ TY_(UngetChar)(c, doc->docIn);
+ } else if (c != ']')
continue;
/* now look for '>' */ Simply, if we reach a With that patch, now the output changes, and, as can be seen, the trailing content of the example is not <!-- Silence warnings. Doctype and title do not matter to the problem. -->
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.7.35.I896">
<title>Word 2000 Problem</title>
<![endif] EXTRA ]>
</head>
<body>
<p>Content</p>
</body>
</html> Please ignore the tidy version number. I piggy backed this patch onto the So, I am removing the After some more testing, will try to get around to setting up a PR for this, to get it in to Meantime, look forward to further feedback, comments, patches, PR, testing, regressions, etc, etc... thanks... |
@ralfjunker see the fuller, more complete, patch in #487, that solves a 2nd problem as well... Look forward to a PR to put all this in |
Thanks, @geoffmcl , for catching up on this issue again! As of today, my original issue report dates back over 4 years. In these 4 years, just two people, @geoffmcl and me, have commented on it. The issue is about code initially intended to clean up non-standard MS Word 2000 sections. Word 2000 is now over 21 years old, and superseded by Word 2002 in 2001. Without knowing when the issue entered the Tidy code, it might have been unnoticed for up to 17 years. Theses are very long time spans for a specification which is officially a "living standard" since 2019. Given these facts, I dare to assume that interest for MS Word 2000 sections in Tidy has dropped to practically zero by now. As MS Word 2000 sections were never part of the HTML standard, we can just guess how to handle them correctly. I am sure you are aware that your proposed patch in #462 (comment) adds an extra To cut a long story short: MS Word 2000 sections in Tidy are (a) non-standardized, (b) buggy, (c) clutter up the code, (d) complicate further development, and (e) probably not needed any more. I suggest to (1) implement a simple fix, (2) declare MS Word 2000 section support as deprecated and (3) remove related code rather sooner than later. HTML as a living standard is developing too fast as to keep oneself busy with outdated MS 2000 sections. Time is better spent with implementing new and more useful features. |
@ralfjunker thank you for the feedback, most of which I fully agree with... except perhaps the conclusions... I treat your snippet, just as another block of html code, and take the view that And like other So this is not only about And the patch addition, to output a warning message, is again part of being a robust html parser... warn on confusion... so the user can locate, and fix... If you are suggesting the I assume there are still
Could not agree more... curious, what I must get around to adding these small patches, and will probably piggy back it to #898, just for convenience... seems too small to create a new branch, etc, ... and it is relate to Feels good to get some issues closed... thanks... |
The warning message could perhaps be better worded, and maybe there should be another msg when a '>' is encountered while looking for a ']' in a MS Word section, and perhaps the section should be discarded... And perhaps it should be an error, to force the user to fix... But the fix is good as it is, and these issues can be dealt with later... And this fix is piggy backed on this PR, but it is likewise related to 'word-2000' option...
* Is. #896 - make 'bear' docs match code * Is. #487 #462 add warn msg and do not get stuck until eof The warning message could perhaps be better worded, and maybe there should be another msg when a '>' is encountered while looking for a ']' in a MS Word section, and perhaps the section should be discarded... And perhaps it should be an error, to force the user to fix... But the fix is good as it is, and these issues can be dealt with later... And this fix is piggy backed on this PR, but it is likewise related to 'word-2000' option...
@ralfjunker fix now merged, so closing this... thanks... |
The following "malformed" Word 2000 sequence causes Tidy to skip document content (notice the
extra
characters):<![endif]extra>
Reason is that when Tidy sees
<![
not followed byCDATA[
, it expects a Word 2000 sequence like this:<![endif]>
In particular, Tidy expects the above sequence to terminate in
]>
or]-->
, which neither HTML specification nor modern browser does.As a result, Tidy skips content because as it looks for
]>
, possible until the end of the document.Without testing, code in lexer.c suggest that similar "malformed" ASP, JSTE, and PHP sequences might likewise throw Tidy off track.
AFAIK, none of the four sequences have ever been covered by any of the HTML specs. I strongly recommend options to disable parsing them. Suggestions:
The text was updated successfully, but these errors were encountered: