Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Malformed" Word 2000 sequence may cause Tidy to skip document content #462

Closed
ralfjunker opened this issue Oct 23, 2016 · 10 comments
Closed

Comments

@ralfjunker
Copy link

ralfjunker commented Oct 23, 2016

The following "malformed" Word 2000 sequence causes Tidy to skip document content (notice the extra characters):

<![endif]extra>

Reason is that when Tidy sees <![ not followed by CDATA[, it expects a Word 2000 sequence like this:

<![endif]>

In particular, Tidy expects the above sequence to terminate in ]> or ]-->, which neither HTML specification nor modern browser does.

As a result, Tidy skips content because as it looks for ]>, possible until the end of the document.

Without testing, code in lexer.c suggest that similar "malformed" ASP, JSTE, and PHP sequences might likewise throw Tidy off track.

AFAIK, none of the four sequences have ever been covered by any of the HTML specs. I strongly recommend options to disable parsing them. Suggestions:

  • TidyParseWord2000
  • TidyParseASP
  • TidyParseJSTE
  • TidyParsePHP
@ralfjunker ralfjunker changed the title "Malformed" Word 2000 sequence may cause Tidy to skip rest of document "Malformed" Word 2000 sequence may cause Tidy to skip document content Oct 23, 2016
@geoffmcl
Copy link
Contributor

@ralfjunker, thanks for the report...

Dealing with Word 2000 was a very specific set of clutter to remove, and I think could easily be encouraged to skip any chars between <![endif] and >, if desired...

If possible, please give a minimal sample html, tidy version, config, etc...

Not sure why you include other Tidy parser's, like ASP, JSTE, PHP, ... advise more...

But more specific information would also be helpful, like where, how, was this Word 2000 html document generated... thanks...

@geoffmcl geoffmcl added this to the 5.3 milestone Oct 31, 2016
@ralfjunker
Copy link
Author

ralfjunker commented Oct 31, 2016

Here is a minimal HTML document example:

<!-- Silence warnings. Doctype and title do not matter to the problem.  -->
<!DOCTYPE html>
<title>Word 2000 Problem</title>

<![endif] EXTRA >

<p>Content</p>

I run Tidy 5.2.0 on Windows without options, but GIT trunk behaves the same:

tidy.exe file.htm

This is the output I receive:

Info: Document content looks like HTML5
No warnings or errors were found.

<!-- Silence warnings. Doctype and title do not matter to the problem.  -->
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content="HTML Tidy for HTML5 for Windows version 5.2.0">
<title>Word 2000 Problem</title>
</head>
<body>
</body>
</html>

Notice that the <p>Content</p> block is completely removed in spite of explicitly reporting no warnings nor errors. Size or state of the content block do not matter, Tidy removes it regardless.

The actual problem is in lexer.c here:

tidy-html5/src/lexer.c

Lines 3196 to 3236 in fd0ccb2

if (c != ']')
continue;
/* now look for '>' */
c = TY_(ReadChar)(doc->docIn);
lexdump = 1;
if (c != '>')
{
/* Issue #153 - can also be ]'-->' */
if (c == '-')
{
c = TY_(ReadChar)(doc->docIn);
if (c == '-')
{
c = TY_(ReadChar)(doc->docIn);
if (c != '>')
{
TY_(UngetChar)(c, doc->docIn);
TY_(UngetChar)('-', doc->docIn);
TY_(UngetChar)('-', doc->docIn);
continue;
}
/* this failed!
TY_(AddCharToLexer)(lexer, '-'); TY_(AddCharToLexer)(lexer, '-'); lexdump = 0;
got output <![endif]--]> - needs furhter fix in pprint section output
*/
}
else
{
TY_(UngetChar)(c, doc->docIn);
TY_(UngetChar)('-', doc->docIn);
continue;
}
}
else
{
TY_(UngetChar)(c, doc->docIn);
continue;
}
}

I was able to work around the problem by introducing the previously mentioned new TidyParseWord2000 option. If disabled, it causes the lexer to interpret Word 2000 sequences as processing instructions instead. Note that the TidyXmlPIs option still differentiates between ?> and > as terminator. Here are my changes to lexer.c:

@@ -3195,6 +3195,11 @@
                     }
                 }

+                if (!cfgBool(doc, TidyParseWord2000)) {
+                    lexer->lexsize--;
+                    TY_(AddCharToLexer)(lexer, '<');
+                    TY_(AddCharToLexer)(lexer, '!');
+                    TY_(AddCharToLexer)(lexer, '[');
+                    TY_(AddCharToLexer)(lexer, c);
+                    lexer->state = LEX_CONTENT;
+                    continue;
+                }
+
                 if (c != ']')
                     continue;

Changes to include/tidyenum.h and src/config.c are obvious so excluded here for brevity.

PS: The problem was presented to me by a customer as part of a larger document. I have no idea how it was generated. However, the document origin should not matter as such documents do (or will, eventually) exist in the wild so we should be prepared to handle them well.

@geoffmcl
Copy link
Contributor

@ralfjunker just noted that this topic has more or less been raised again in #487...

As suggested there one or the other should be closed while we seek a suitable solution for this extra data between the closing ]> endif.

And note maybe there can also be extra data in the opening <!xxx[...

Seek ideas, comments, patches or PR... thanks...

@geoffmcl
Copy link
Contributor

As we are about to release 5.4, moving this out to next 5.5... and as stated, maybe we should close one of these two issues which refer to the same problem..

@geoffmcl geoffmcl modified the milestones: 5.5, 5.3 Feb 27, 2017
@geoffmcl
Copy link
Contributor

Although no further comments in a long time, appears still open, although also seems duplicate of #487, so moving out the milestone again...

@geoffmcl geoffmcl modified the milestones: 5.5, 5.7 Nov 29, 2017
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 6, 2020

@ralfjunker took another look at this... and have another idea...

Now do not think there is any need for a Feature to be added...

I was stuck by the comment, in the code, after receiving <! from the stream, suggesting there may be more to do here, and the if tree that follows - code omitted, some code relined for clarity -

    /*
       look out for comments, doctype or marked sections
       this isn't quite right, but its getting there ...
    */
    if (c == '!') {
        c = TY_(ReadChar)(doc->docIn);
        if (c == '-') {
            // deal with a `<!-` comment
            lexer->state = LEX_COMMENT;  /* comment */
        } else if (c == 'd' || c == 'D') {
            /* todo: check for complete "<!DOCTYPE" not just <!D */
            lexer->state = LEX_DOCTYPE; /* doctype */
        } else if (c == '[') {
            /* Word 2000 embeds <![if ...]> ... <![endif]> sequences */
            lexer->state = LEX_SECTION;
        }
        /* else swallow characters up to and including next '>' */
        while ((c = TY_(ReadChar)(doc->docIn)) != '>') {
            // or eof
        }
        lexer->state = LEX_CONTENT;
    }

Now the issue here is a malformed word 2000 embeded LEX_SECTION, like <![endif] EXTRA >, which gets locked in searching for another ] character, until EOF, if need be... as happens with the given example code...

In the process can pass over multiple > chars, and <, and in fact, everything that is NOT a ]!

So on re-reading the this isn't quite right comment, decided maybe this is a BUG, and looked a fix...

Came up with a patch for the lexer.c -

diff --git a/src/lexer.c b/src/lexer.c
index ef70e13..61c28eb 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -3340,7 +3340,11 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
                     }
                 }
 
-                if (c != ']')
+                if (c == '>')
+                {
+                    /* Is. #462 - reached '>' before ']' */
+                    TY_(UngetChar)(c, doc->docIn);
+                } else if (c != ']')
                     continue;
 
                 /* now look for '>' */

Simply, if we reach a >, let that close this section, store the section, and back to lexer->state = LEX_CONTENT, and continue to complete the document...

With that patch, now the output changes, and, as can be seen, the trailing content of the example is not lost...

<!-- Silence warnings. Doctype and title do not matter to the problem.  -->
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.7.35.I896">
<title>Word 2000 Problem</title>
<![endif] EXTRA ]>
</head>
<body>
<p>Content</p>
</body>
</html>

Please ignore the tidy version number. I piggy backed this patch onto the issue-896 branch...

So, I am removing the Feature Request label, replacing it with Bug, which I now think it is...

After some more testing, will try to get around to setting up a PR for this, to get it in to next... unless someone beats me to it...

Meantime, look forward to further feedback, comments, patches, PR, testing, regressions, etc, etc... thanks...

@geoffmcl geoffmcl removed this from the 5.7 milestone Oct 6, 2020
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 7, 2020

@ralfjunker see the fuller, more complete, patch in #487, that solves a 2nd problem as well...

Look forward to a PR to put all this in next... thanks...

@ralfjunker
Copy link
Author

Thanks, @geoffmcl , for catching up on this issue again!

As of today, my original issue report dates back over 4 years. In these 4 years, just two people, @geoffmcl and me, have commented on it.

The issue is about code initially intended to clean up non-standard MS Word 2000 sections. Word 2000 is now over 21 years old, and superseded by Word 2002 in 2001. Without knowing when the issue entered the Tidy code, it might have been unnoticed for up to 17 years.

Theses are very long time spans for a specification which is officially a "living standard" since 2019.

Given these facts, I dare to assume that interest for MS Word 2000 sections in Tidy has dropped to practically zero by now. As MS Word 2000 sections were never part of the HTML standard, we can just guess how to handle them correctly.

I am sure you are aware that your proposed patch in #462 (comment) adds an extra ]. This is certainly better than to skip document content, but is it also "correct"? Or should the malformed section better be parsed as text without the extra ]? Lacking a standard, we cannot know and must simply guess.

To cut a long story short: MS Word 2000 sections in Tidy are (a) non-standardized, (b) buggy, (c) clutter up the code, (d) complicate further development, and (e) probably not needed any more.

I suggest to (1) implement a simple fix, (2) declare MS Word 2000 section support as deprecated and (3) remove related code rather sooner than later.

HTML as a living standard is developing too fast as to keep oneself busy with outdated MS 2000 sections. Time is better spent with implementing new and more useful features.

@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 8, 2020

@ralfjunker thank you for the feedback, most of which I fully agree with... except perhaps the conclusions...

I treat your snippet, just as another block of html code, and take the view that tidy simply can not consume the balance of the document, searching for a ]... that is a parsing bug that has to be addressed...

And like other malformed html code, tries to fix it, as best it can... maybe not always 100% correct... ie an additional ], in this case... and maybe there should be a warning issued... but can leave that for now, as the patches are an incremental improvement...

So this is not only about MS Word 2000 section support, but a parser that should never get locked in a loop...

And the patch addition, to output a warning message, is again part of being a robust html parser... warn on confusion... so the user can locate, and fix...

If you are suggesting the word-2000 option as deprecated, let alone removing code, seems a mute point... maybe I misunderstand you here...

I assume there are still MS Word users still out there... have seen (unverified) numbers like 1.2 billion users worldwide - also see recent #885 896 898, and others, so I think it should stay... it's been there neigh on 20 years... doing its job... not much maintenance...

Time is better spent with implementing new and more useful features.

Could not agree more... curious, what new and more useful features you are thinking of...

I must get around to adding these small patches, and will probably piggy back it to #898, just for convenience... seems too small to create a new branch, etc, ... and it is relate to word-2000... time will tell...

Feels good to get some issues closed... thanks...

geoffmcl added a commit that referenced this issue Oct 9, 2020
The warning message could perhaps be better worded, and maybe there
should be another msg when a '>' is encountered while looking for a ']'
in a MS Word section, and perhaps the section should be discarded...

And perhaps it should be an error, to force the user to fix...

But the fix is good as it is, and these issues can be dealt with
later...

And this fix is piggy backed on this PR, but it is likewise related to
'word-2000' option...
geoffmcl added a commit that referenced this issue Nov 21, 2020
* Is. #896 - make 'bear' docs match code

* Is. #487 #462 add warn msg and do not get stuck until eof

The warning message could perhaps be better worded, and maybe there
should be another msg when a '>' is encountered while looking for a ']'
in a MS Word section, and perhaps the section should be discarded...

And perhaps it should be an error, to force the user to fix...

But the fix is good as it is, and these issues can be dealt with
later...

And this fix is piggy backed on this PR, but it is likewise related to
'word-2000' option...
@geoffmcl
Copy link
Contributor

@ralfjunker fix now merged, so closing this... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants