Closed
Description
The following "malformed" Word 2000 sequence causes Tidy to skip document content (notice the extra
characters):
<![endif]extra>
Reason is that when Tidy sees <![
not followed by CDATA[
, it expects a Word 2000 sequence like this:
<![endif]>
In particular, Tidy expects the above sequence to terminate in ]>
or ]-->
, which neither HTML specification nor modern browser does.
As a result, Tidy skips content because as it looks for ]>
, possible until the end of the document.
Without testing, code in lexer.c suggest that similar "malformed" ASP, JSTE, and PHP sequences might likewise throw Tidy off track.
AFAIK, none of the four sequences have ever been covered by any of the HTML specs. I strongly recommend options to disable parsing them. Suggestions:
- TidyParseWord2000
- TidyParseASP
- TidyParseJSTE
- TidyParsePHP