Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace removed after <a>...</a> #20

Closed
ralfjunker opened this issue Mar 19, 2012 · 8 comments
Closed

Whitespace removed after <a>...</a> #20

ralfjunker opened this issue Mar 19, 2012 · 8 comments

Comments

@ralfjunker
Copy link

Current matster (4ff3234) removes white space following <a>...</a>. This HTML

<html>
<head>
</head>
<body>
<a href="n" name="n">one</a> two
</body>
</html>

is cleaned up to

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<a href="n" name="n" id="n">one</a>two
</body>
</html>

Notice that the single white space before "two" is missing in the output.

This is a regression from the original Tidy from SourceForge.

@thieso2
Copy link
Contributor

thieso2 commented Mar 20, 2012

This patch fixes it:

https://github.com/thieso2/tidy-html5/commit/0ef710fe90119e204f01adc6a9fc3635b5ab15cb

It doesn't seem to break

b0997b2

which introduces the regression.

If only I knew how to add testcases to this project I'd write a test and send a pull-request!

@ralfjunker
Copy link
Author

Patch working fine.Thanks!

@sideshowbarker
Copy link
Contributor

I cherry-picked the change and pushed it to the w3c/tidy-html5.

I would like to add testcases too but this whole thing's experimental at this point. It somebody feels really strongly that we shouldn't be making code changes without adding test cases, there's nothing stopping anybody from volunteering to go through the changelog and write up tests.

@sideshowbarker
Copy link
Contributor

See a3d49a7#commitcomment-1143626

I need to back out the change.

The markup fix for this is to put the element inside something other than

or

or whatever.

sideshowbarker added a commit that referenced this issue Mar 29, 2012
@sideshowbarker
Copy link
Contributor

Note that you will see this same behavior with <ins> and <del> also, if you do <body><ins>one</ins> two</body>.

It's a consequence of the fact that in HTML5, the <a> element, like the <ins> and <del> can contain block content, but it can also still be used inline. But that doesn't fit well with the model that Tidy uses for whitespace handling.

We'll have to figure out a better fix for this later, but for the time being, the workaround is to not have the <a> element as a child of the <body> element.

@sideshowbarker
Copy link
Contributor

The root cause of this is in the part of the parser code that parses the contents of the body element:

https://github.com/w3c/tidy-html5/blob/master/src/parser.c#L3291

@sideshowbarker
Copy link
Contributor

So the specific cause for this is at line 3528 of parser.c:

https://github.com/w3c/tidy-html5/blob/master/src/parser.c#L3528

if ( TY_(nodeHasCM)(node, CM_INLINE) && !TY_(nodeHasCM)(node, CM_MIXED) )

Removing the && !TY_(nodeHasCM)(node, CM_MIXED) from that statement will cause the whitespace for this case to be preserved.

I don't really understand why that condition is there, but it seems it might relate to the comment just after it:

/* HTML4 strict doesn't allow inline content here */
/* but HTML2 does allow img elements as children of body */

...which is true: in HTML4 strict, <body><a> is invalid. But I don't see what that has to do with how the whitespace is handled.

I could try removing the && !TY_(nodeHasCM)(node, CM_MIXED) but I took a look at the code and I notice it's always been that way, since r1.1 of the original Tidy sources in CVS. So I'm a bit reluctant to change it for fear of introducing regressions.

Will need to think more about what to do for this. I guess introducing a new option might be one choice.

@sideshowbarker
Copy link
Contributor

Tried a few more cases and I still really can't see what was the original intent of having the code ignore whitespace after these CM_MIXED elements. So I'm going ahead to check in this change. If this breaks anything, it will be restricted to the ins, del, a, script, and noscript elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants