Skip to content

"Malformed" Word 2000 sequence may cause Tidy to skip document content #462

Closed
@ralfjunker

Description

@ralfjunker

The following "malformed" Word 2000 sequence causes Tidy to skip document content (notice the extra characters):

<![endif]extra>

Reason is that when Tidy sees <![ not followed by CDATA[, it expects a Word 2000 sequence like this:

<![endif]>

In particular, Tidy expects the above sequence to terminate in ]> or ]-->, which neither HTML specification nor modern browser does.

As a result, Tidy skips content because as it looks for ]>, possible until the end of the document.

Without testing, code in lexer.c suggest that similar "malformed" ASP, JSTE, and PHP sequences might likewise throw Tidy off track.

AFAIK, none of the four sequences have ever been covered by any of the HTML specs. I strongly recommend options to disable parsing them. Suggestions:

  • TidyParseWord2000
  • TidyParseASP
  • TidyParseJSTE
  • TidyParsePHP

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions