Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected parsing with uppercase DOCTYPE #815

Closed
da2x opened this issue Apr 23, 2019 · 6 comments
Closed

Unexpected parsing with uppercase DOCTYPE #815

da2x opened this issue Apr 23, 2019 · 6 comments
Labels

Comments

@da2x
Copy link

da2x commented Apr 23, 2019

From my blog post detailing the issue. Both of the following two test case documents trigger Info: Document content looks like HTML5 message but produce different results (see below).

Test case 1 (lower case doctype):

echo '<!DOCTYPE html><a href="#"><p>Text</p></a>' | tidy

Test case 1 output (works as expected):

<!DOCTYPE html>
<html>
<body>
<a href="#">
<p>Text</p>
</a>
</body>
</html>

Test case 2 (upper case doctype):

echo '<!DOCTYPE HTML><a href="#"><p>Text</p></a>' | tidy

Test case 2 output (broken; uses legacy pre-html5 parsing):

<!DOCTYPE html>
<html>
<body>
<a href="#"></a>
<p>Text</p>
</body>
</html>

The spec clearly states that the DOCTYPE should be parsed case-insensitively.

(Output in both examples have been trimmed to focus on the differences.)

@geoffmcl
Copy link
Contributor

@da2x thank you for the issue... it seems a good catch...

The clue was your mention of uses legacy pre-html5 parsing... yes, that is what is happening...

Have experimented with a simple fix, to code I added through commit 0dc68d6, Fri Mar 6 12:49:30 2015 +0100, as follows -

diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..fc2a4e3 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -1828,7 +1828,11 @@ static uint FindGivenVersion( TidyDocImpl* doc, Node* doctype )
 
     if (!fpi || !fpi->value) 
     {
-        if (doctype->element && (TY_(tmbstrcmp)(doctype->element,"html") == 0))
+        /*\
+         * Is. #815 - change to case-insensitive test
+         * See REC: https://www.w3.org/TR/html5/syntax.html#the-doctype
+        \*/
+        if (doctype->element && (TY_(tmbstrcasecmp)(doctype->element,"html") == 0))
         {
             return VERS_HTML5;  /* TODO: do we need to check MORE? */
         }

Will try to get around to putting this fix in a branch and create a PR, for wider testing... unless someone beats me to it... maybe remind me if taking too long...

Meantime, look forward to any further, other feedback... thanks...

@geoffmcl
Copy link
Contributor

@da2x, and/or others, please checkout branch issue-815, test, and report... thanks...

@da2x
Copy link
Author

da2x commented May 3, 2019

The branch resolves the described issue and I’ve not noticed any regressions.

@geoffmcl
Copy link
Contributor

@da2x got around to creating the #832 PR...

Have not had time to do the regression tests... have you?

@dd8
Copy link

dd8 commented Aug 21, 2019

We've just been bitten with this as well - and are pushing the fix into production (caveat - we only use tidy for pretty printing)

These might be useful - reduced test cases for HTML5 mode:

<!DOCTYPE html>
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>
<!DOCTYPE HTML>
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>
<!DOCTYPE html SYSTEM "about:legacy-compat">
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>
<!DOCTYPE HTML SYSTEM "about:legacy-compat">
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>
<!DOCTYPE html SYSTEM 'about:legacy-compat'>
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>
<!DOCTYPE HTML SYSTEM 'about:legacy-compat'>
<html lang='en'>
<body>
   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>
</body>
</html>

without the fix AdjustTags transforms:

   <a href="/" onclick="">
      <p class="description">Social Media Policy</p>
   </a>

into

   <a href="/" onclick=""></a>
   <p class="description">Social Media Policy</p>

geoffmcl added a commit that referenced this issue Sep 28, 2020
Is #815 - Use case-insensitive test 'html'
@geoffmcl
Copy link
Contributor

@da2x @dd8 got around to doing the regression testing, but nothing popped...

Have not yet got around to adding some tests for this, but at least added in_815.html and in_815-1.html legacy doc, to my sample repo... base on your 2nd sample...

Any help getting these into the regression repo would be most appreciated... thanks...

But meantime closing this as fixed... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants