Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy seems to get confused by HTML strings in JavaScript blocks. #700

Closed
petdance opened this issue Mar 19, 2018 · 10 comments
Closed

Tidy seems to get confused by HTML strings in JavaScript blocks. #700

petdance opened this issue Mar 19, 2018 · 10 comments
Labels
Milestone

Comments

@petdance
Copy link
Contributor

I believe that the <script> tag should be telling tidy not to check its contents, but tidy complains about the </p> in a string.

$ cat foo.html
<!DOCTYPE html>
<html>
    <head>
        <title>scratch</title>
    </head>
    <body>
<script><!--
str = '</p>';
--></script>
    </body>
</html>
$ tidy -q -e foo.html
line 8 column 9 - Warning: '<' + '/' + letter not allowed here
@petdance petdance added the Bug label Mar 19, 2018
@geoffmcl
Copy link
Contributor

@petdance thank you for the issue.

Rather than confused I think this is yet another case where tidy has yet to fully catch up to HTML5!

Try passing the following to the validator, and you may see what I mean -

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
           "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>Is #700-1</title>
    </head>
    <body>
<script type="text/javascript"><!--
str = '</p>';
--></script>
    </body>
</html>

So yes, it seems a tidy bug in a html5 document, but fully correct in legacy html4 documents...

As stated seems yet another case where the tidy behaviour should be modified, depending on the current document mode, established by the doctype...

I am sure there are probably more of these html4 vs html5 parsing as tidy tries to catch up and deal with both fully correctly in one library...

Look forward to further feedback, a patch or a PR... should not be too difficult to fix... thanks...

@geoffmcl geoffmcl added this to the 5.7 milestone Mar 19, 2018
@petdance
Copy link
Contributor Author

I'm not sure what you're saying here. Are you saying that the behavior should change if it's HTML 4 vs. HTML 5?

@geoffmcl
Copy link
Contributor

@petdance in essence, yes...

And internally in the library we have many cases of this behavior changing, depending whether tidy detects a legacy html4 or earlier doctype, vs. a html5 or no doctype, which defaults to html5 mode...

See ResetTags which resets to the default html5 mode, but if a legacy doctype is found see AdjustTags, but that is mainly only for main tag table values...

But there are other places where IsHTML5Mode is called, but that was only added relatively recently, and in other places things like if (TY_(HTMLVersion)(doc) == HT50) or Bool htmlIs5 = (doc->lexer->doctype & VERS_HTML5) > 0;, and probably some other indicators as well...

Unfortunatley, this html4 vs. html5 mode has not been standardized in the code! Maybe it should be IsHTML5Mode in all case, but is presently quite a can of worms! Sorry about that...

You may remember back in the history there was a discussion if we should have two separate libraries, legacy vs. html5, like the legacy validator, vs. nu, but still feel it was the right choice to try to put it all in one...

But still some obvious work to do on this... thanks...

@geoffmcl
Copy link
Contributor

@petdance started looking at this, and noted we already have an option, --escape-script no, TidyEscapeScripts, to avoid this...

The code is in lexer.c, in the service GetCDATA, which is obviously used for other than just gathering a CDATA node text for the <script> tag -

    /*\ if javascript insert backslash before / 
     *  Issue #348 - Add option, escape-scripts, to skip
    \*/
    if ((TY_(IsJavaScript)(container)) && cfgBool(doc, TidyEscapeScripts))
    {
        /* Issue #281 - only warn if adding the escape! */
        TY_(Report)(doc, NULL, NULL, BAD_CDATA_CONTENT);

Thus extending that code a little, with -

diff --git a/src/lexer.c b/src/lexer.c
index 3d6a489..ca66aee 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -2384,7 +2384,8 @@ static Node *GetCDATA( TidyDocImpl* doc, Node *container )
                 /*\ if javascript insert backslash before / 
                  *  Issue #348 - Add option, escape-scripts, to skip
                 \*/
-                if ((TY_(IsJavaScript)(container)) && cfgBool(doc, TidyEscapeScripts))
+                if ((TY_(IsJavaScript)(container)) && cfgBool(doc, TidyEscapeScripts) &&
+                    !TY_(IsHTML5Mode)(doc) )    /* Is #700 - This only applies to legacy html4 mode */
                 {
                     /* Issue #281 - only warn if adding the escape! */
                     TY_(Report)(doc, NULL, NULL, BAD_CDATA_CONTENT);

Still testing this, but looks good... Will get to pushing it to an issue-700 branch, and setting up a PR...

Meantime, appreciate it if others could apply the patch, build, test and report... thanks

@geoffmcl
Copy link
Contributor

@petdance have pushed this fix to an issue-700 branch...

Have also now run our regression tests, and find a difference in one(1) test, case 443576...

It is a test where the script element contains document.write('<script></script>'); which used to trick tidy into thinking it had found the </script> close tag with very bad consequences. It has its own config file with skip-nested: no...

But the input does not have a doctype so tidy-5.7.3.I700 stays in html5 mode, thus the html output is no longer escaped, and there is no warning about the '<' + '/' + letter not allowed here... so both the html output and error files are changed...

The W3C validator chooses to default to HTML 4.01 Transitional in this case, thus as well as other things, finds a 2nd end tag </script> when it thinks none are open... so it is also fooled by this unescaped end tag within the script...

Adding a <!DOCTYPE html> doctype pushes it to nu that does not have any problem, except with <script language="JavaScript1.1">, but reducing that to just <script> gives the document a green light...

So I see no problem adjusting our test 443576, to either -

  1. add a html 4.01 doctype, or
  2. accepting this patched tidy output as the new expected output.

Given the legacy nature of our tests I think adding a doctype is best...

Look forward to further feedback on this and testing of the issue-700 branch... thanks...

@petdance
Copy link
Contributor Author

This seems related to #475.

@geoffmcl
Copy link
Contributor

@petdance it certainly is... in a way the same, but there the use of --escape-scripts no was accepted as a solution, although the difference between legacy html4 and html5 mode was mentioned...

Here this patch completely drops the warning and the escaping if still in html5 mode, where it is not needed... but keeps it if in legacy html4 mode, as indicated by the doctype, unless the option given... remembering that in tidy no doctype assumes html5...

Have create PR #703, to merge this issue-700 branch to next, and created PR 27 in the regression tests that needs to be done at the same time... in the regression tests added a html 4.01 doctype to the case-443576 input, and adjusted the expects accordingly..

As always, look forward to further feedback on this and testing of the issue-700 branch, including the regression tests... thanks...

@petdance
Copy link
Contributor Author

I don't know what I would be doing to test the branch's regression tests. Point me in the right direction and I'll see what I can do.

@geoffmcl
Copy link
Contributor

@petdance the regression tests are in a separate repo - https://github.com/htacg/tidy-html5-tests

In re-reading the README.md there I am not sure we have kept the master branch, ie. the last release, completely in sync with the latest master release branch in the tidy source. I should check this, and maybe add an issue to update it, if not...

But it should be true if you set both to the default next branches...

Then as the README.md states, there is a directory tools-cmd for Windows, and a tools-sh for unix, each with their own README.md, which further describes the testing process...

Of course the tools for windows are *.bat files, and for unix *.sh scripts, but both use environment variables to set up what to do, the most important is TY_TIDY_PATH. This is the full path to the tidy to use, and must be set by you...

This is another reason why by default the console app tidy is linked with the static library, so you can test with various version of tidy without concern about what shared library is installed.

And in this issue-700 case, first a tidy-700 needs to be built using that branch in the source. And that version of tidy run on the issue-700 branch of the regression tests to get a successful result...

The tests are in sort of two phases. The first is to run tidy on a test case input, with a config in some cases, and write an ouput html and error file, and the tidy exit is checked, but this is only the first part of the test. Then the current output is compared to an expects output, and there must be no differences in either phase for a SUCCESSFUL test run...

There are also several sets of tests, but only the testbase cases have been fully checked.

There are also xml, access, and special cases but these probably need some work. See even issue 20 , for xml, which has been open for many months...

But we do encourage devs to run the testbase cases for each source fix, as they are a good test for backward compatibility, ie. regression testing... Advise if you need further help... thanks...

geoffmcl added a commit that referenced this issue Apr 19, 2018
Is #700 - change script parsing if in html5 mode - PR #703
geoffmcl added a commit that referenced this issue Apr 19, 2018
@geoffmcl
Copy link
Contributor

@petdance have now merged PR #703, and in the regression test, merged PR 27 to match...

Have re-run the regression tests next branch, version 5.7.8, using the now next branch here 5.7.8 and all passes...

Acordingly closing this, but if I missed something, please feel free to re-open, or open a new issue... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants