Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy fails if html contains a section <![endif]—> #487

Closed
rkopaliani opened this issue Feb 6, 2017 · 11 comments
Closed

Tidy fails if html contains a section <![endif]—> #487

rkopaliani opened this issue Feb 6, 2017 · 11 comments

Comments

@rkopaliani
Copy link

rkopaliani commented Feb 6, 2017

Hello. First of all thank you for great work.
I found a little bug almost identical to this old one #153, but looks like here problem is with em dash.
I've constructed a little example
<html>
<title> Some title </title>
<!—[if mso]>
<![end if]—>
<body>
<p> Yada yada yada </p>
</body>
</html>

In this case all content in body will be cut off.

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 7, 2017

@rkopaliani thanks for your issue, but I am confused over a few things here...

  1. A problem with the [if mso] comment?
  2. What is the DOCTYPE you want?
  3. What character set are you using?

After a few attempts I was able to construct in input which passes W3C validation -

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/html4/loose.dtd">
<html>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<title>Issue #487-2</title>
  <!--[if mso]>
  <![end if]-->
<body>
<p> Yada yada yada </p>
</body>
</html>

And the latest Tidy also has no problem with this -

tidy input5\in_487-2.html
Info: Doctype given is "-//W3C//DTD HTML 4.01 Transitional//EN"
Info: Document content looks like HTML 4.01 Strict
No warnings or errors were found.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Windows version 5.3.15">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Issue #487-2</title>
<!--[if mso]>
  <![end if]-->
</head>
<body>
<p>Yada yada yada</p>
</body>
</html>

About HTML Tidy: https://github.com/htacg/tidy-html5
etc...

But could never get <!—[if mso]><![end if]—>, ie using em dash, to pass the validator - get warning, and two errors -

uploaded file in_487.html
 1. Warning: Legacy encoding windows-1252 used. Documents should use UTF-8.
 2. Error: Bogus comment.
    At line 5, column 7 <!—[if mso]>
 3. Error: Bogus comment.
    At line 6, column 7 <![end if]—>

And as you point out, tidy just eats all the <body>...

Tidy also has no problem if I put the em dash in the body text, like <p> Yada — yada — yada </p>, and add the option --char-encoding win1252, except it will convert the output to entities, like <p>Yada &mdash; yada &mdash; yada</p>

I have also re-tested issue #153, with current tidy, and there seems still no problem there, so maybe these are not related...

So I am afraid you really need to clarify what you want to do!

The premier thing would be to give a simple sample document that passes https://validator.w3.org/#validate_by_upload, then show what tidy does wrong, and what you expect of tidy... thanks...

@geoffmcl geoffmcl added this to the 5.3 milestone Feb 7, 2017
@rkopaliani
Copy link
Author

Hi again @geoffmcl
Sorry, looks like I've really wasted your time. Anyway, will try to explain briefly what I've meant, maybe it helps somehow. This issue is not directly related to #153. But for some reason I though they look quite similar – both spotted when using [if mso] and both lead to some portion of the content being cut out.
To answer your questions:

  1. Problem is indeed with [if mso] comments
  2. DOCTYPE html
  3. utf8

I am totally new to Tidy, so just to clarify. It is expected of Tidy to cut out content in body after tidyCleanAndRepair is called against something like this

<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\"http://www.w3.org/TR/html4/loose.dtd\">
<html>
<head>
<title>Some title</title>
<!—[if mso]>
<![end if]—>
</head>
<body>
<p>Some text to cut out </p>
</body>
</html>

because this html is not valid itself ?

@balthisar
Copy link
Member

@rkopaliani, are you trying to suggest that Tidy should catch instances of em-dash where hyphens are supposed to be used? Are you using some non-ASCII text editor that's converting -- to an em-dash character? Your sample code includes em-dashes for some reason...

I agree that Tidy should report some type of error when encountering what is actually garbage, though.

@rkopaliani
Copy link
Author

@balthisar yep, that's actually what I'm trying to suggest, sorry for being unclear. I try to use tidylib to clean up htmls for my project, I don't have any control of what kind of html I receive, unfortunately.

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 8, 2017

@rkopaliani, @balthisar, thanks for the added comments...

I have now had time to do a careful MSVC debug session, using your sample case, for a few hours, very slow, meticulous and tedious, and here is what I found, in detail...

In lexer.c, in GetTokenFromStream(), on finding a < char switches state lexer->state = LEX_GT; continue; to get next char. When in this state, if the next character is !, then tidy immediately reads the next character. There is a quite telling comment in the code at this point, which indicates maybe more could or needs to be added -

                /*
                   look out for comments, doctype or marked sections
                   this isn't quite right, but its getting there ...
                */

Now if the next character is a hyphen, -, then tidy gets one more, and if two found, then changes state lexer->state = LEX_COMMENT; /* comment */. Easy, no problem...

But in this case it is an em dash, dec 151, 0x97, which, if the --char-encoding win1251, tidy will convert into utf-16 0x002014. Why, I do not yet know. Need to explore... see more below on this..

If this next had been an [ char, then the state would be set lexer->state = LEX_SECTION;... continue;

But with this em dash, now tidy will now eat characters until it finds a > character... the code comment is /* else swallow characters up to and including next '>' */...

On reaching the next > will scrub out the <! in the lexer, and switch the state lexer->state = LEX_CONTENT; ... continue;, so in the sample given the first <!—[if mso]> will just be wholly discarded, without a warning, or any indication of lost content...

Tidy will now swallow the newline, and leading spaces to get to the next char, which is another <, so switches state again to lexer->state = LEX_GT; continue; and get next...

Since the next is a !, then as above, tidy immediately reads the next, which this time is a '[', so tidy clears the <! from the lexer, and sets lexer->state = LEX_SECTION; continue;... the comment here is /* Word 2000 embeds <![if ...]> ... <![endif]> sequences */...

The next character is an e of endif, and the code comment is case LEX_SECTION: /* seen "<![" so look for "]>" */

So on getting the ] looks for >, or as per #153 also -->, but what it gets is the em dash, transformed, by DecodeWin1252 into 0x2040, per a table with the comment -

/* Mapping for Windows Western character set CP 1252
** (chars 128-159/U+0080-U+009F) to Unicode.
*/
static const uint Win2Unicode[32] =
{
    0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
    0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000,
    0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
    0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178
};

So the 0x97, dec 151, em dash becomes Win2Unicode[151 - 128], ie 0x2014. I do not know if the table is right or wrong, but this page - http://www.fileformat.info/info/unicode/char//2014/index.htm - seems to suggest that it is correct...

Now the weird part! Tidy just puts this 0x2014 back - TY_(UngetChar)(c, doc->docIn); continue;, stays in the LEX_SECTION state. So when it gets this 0x2014 back, and adds it to the lexer, as utf8, e2 80 94, it then goes on looking for the ] char!

BUT the problem, in this case, is it has already passed the ], so, in this case, will reach the end of the document, stuck in this LEX_SECTION state, and at the EOF, will discard everything... and return a NULL node!

At that time tidy was in ParseHead(), which on getting a NULL, will exit there, with, in this case the whole document used up, and discarded...

So, in addition to the above warning about eaten content, I can at least see another warning, something like, "Reached EOF in the get 'section' state, while looking for ]>, or ]-->".

Or alternatively, know, or keep the memory, that we have seen a ], and stop at the next >, regardless... Need to think about this...

And also need to ponder on why the first, agreed badly formed so called section content was discarded. Is there a way to keep this, but still maybe warn about it being mal-formed? Maybe even correct it...

On some other thing read in the comments...

  1. You want DOCTYPE html. That is good. We can forget old, complicate legacy doctypes! Go all html5 - yay!
  2. utf8 - That is also good, but then you can not have an em dash, a win1252 character, in the document!
  3. This has nothing to do with tidyCleanAndRepair(), which is a later stage. The body content is lost during the lexer parsing phase.

More on item 2. There are some utf-8 checkers out there. I have some messy code in https://github.com/geoffmcl/utf8-test, but I am sure there may be others...

Tidy defaults to utf-8, and will scream if passed an invalid sequence, 0x97 in this case, unless explicitly configured to expect other than fully correct utf-8 sequences. So you would need some pre-file processing before passing this to tidy, with the correct --char-encoding xxx config.

So yes, I now see this as a bug, and marking it so. But as stated need to think a little about a solution, or at least a clear warning or two, why most of the document was thrown away...

Further comments would be appreciated... thanks...

@geoffmcl geoffmcl added the Bug label Feb 8, 2017
@geoffmcl
Copy link
Contributor

@rkopaliani, @balthisar, @ralfjunker, I just noticed this is virtually the same as #462... and there a solution has been suggested, namely that tidy should allow <![endif] EXTRA >, which of course would include your <![end if]—>, at least if TidyWord2000 (--word-2000 on), but maybe even without this...

Now that we have a crossrefernce, I this think one or the other should be closed, while we decide a good solution for this...

Seek ideas, comments, patches or PR... thanks...

@geoffmcl
Copy link
Contributor

As we are about to release 5.4, moving this out to next 5.5... thanks

@geoffmcl geoffmcl modified the milestones: 5.5, 5.3 Feb 27, 2017
@geoffmcl
Copy link
Contributor

As appears still open, moving the milestone out yet again...

@geoffmcl geoffmcl modified the milestones: 5.5, 5.7 Nov 28, 2017
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 6, 2020

@rkopaliani, @balthisar, @ralfjunker, please note the patch suggested in #462...

Maybe this will fix, change, this?

Seek ideas, comments, testing, confirmation, patches, PR, etc... thanks...

@geoffmcl geoffmcl removed this from the 5.7 milestone Oct 6, 2020
@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 7, 2020

@rkopaliani, @balthisar, @ralfjunker, have done some preliminary re-testing... with good results...

WOW, re-running the first given sample, in_478.html, using an issue #462 patched tidy... a surprise!

This turns out to be TWO, 2, different libTidy problems -

  1. The first <!—[if mso]> is discarded, without a warning, and
  2. The second <![end if]—> consumes the rest of the file, ie content in body will be cut off.

So the patch in #462 solves the second, and by this fuller, additional, patch, solves BOTH -

diff --git a/src/lexer.c b/src/lexer.c
index ef70e13..49b74f5 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -2777,6 +2777,7 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
                     }
 
 
+                    TY_(Report)(doc, NULL, NULL, MALFORMED_COMMENT_DROPPING); /* Is. #487 */
 
                     /* else swallow characters up to and including next '>' */
                     while ((c = TY_(ReadChar)(doc->docIn)) != '>')
@@ -3340,7 +3341,11 @@ static Node* GetTokenFromStream( TidyDocImpl* doc, GetTokenMode mode )
                     }
                 }
 
-                if (c != ']')
+                if (c == '>')
+                {
+                    /* Is. #462 - reached '>' before ']' */
+                    TY_(UngetChar)(c, doc->docIn);
+                } else if (c != ']')
                     continue;
 
                 /* now look for '>' */

While one might argue, quibble, ... that the MALFORMED_COMMENT_DROPPING message might not convey the exact problem tidy is experiencing, but I think its general idea is correct... improving this could be a new DOCS issue..

Will try to get around to incorporating this into a combined PR... unless someone beats me to it... could use some help here...

Look forward to further feedback, comments, testing, confirmation, other/alternate patches, PR, etc... thanks...

geoffmcl added a commit that referenced this issue Oct 9, 2020
The warning message could perhaps be better worded, and maybe there
should be another msg when a '>' is encountered while looking for a ']'
in a MS Word section, and perhaps the section should be discarded...

And perhaps it should be an error, to force the user to fix...

But the fix is good as it is, and these issues can be dealt with
later...

And this fix is piggy backed on this PR, but it is likewise related to
'word-2000' option...
geoffmcl added a commit that referenced this issue Nov 21, 2020
* Is. #896 - make 'bear' docs match code

* Is. #487 #462 add warn msg and do not get stuck until eof

The warning message could perhaps be better worded, and maybe there
should be another msg when a '>' is encountered while looking for a ']'
in a MS Word section, and perhaps the section should be discarded...

And perhaps it should be an error, to force the user to fix...

But the fix is good as it is, and these issues can be dealt with
later...

And this fix is piggy backed on this PR, but it is likewise related to
'word-2000' option...
@geoffmcl
Copy link
Contributor

@rkopaliani fix now merged, so closing this... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants