-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/html: unexpected whitespace rendering of html #30772
Comments
cc: @bradfitz |
Nit: I think this is related to |
A minimal example input: <html><head></head><body></body>
</html>
Outputs in go: <html><head></head><body>
</body></html> Chrome gives identical output: <html><head></head><body>
</body></html> As does Firefox: <html><head></head><body>
</body></html> |
Things more interesting with comment blocks, where go, Chrome and Firefox all disagree... Input: <html><head></head><body>
<!--a-->
<!--b-->
</body>
</html>
<!--c--> Outputs in go, changes whitespace but keeps final comment in same relative place: <html><head></head><body>
<!--a-->
<!--b-->
</body></html><!--c--> Chrome moves the final comment inside the body: <head></head><body>
<!--a-->
<!--b-->
<!--c--></body></html> Firefox does same as Go with slightly different spacing: <html><head></head><body>
<!--a-->
<!--b-->
</body></html>
<!--c--> |
According to Section 12.2.6.4.22, I believe Chrome is at fault with regards to the final comment positioning. However, I think that is a separate issue to the original reported issue here. |
There are a couple of facets to this IMO:
So what to do? Given that browser implementations are out of scope to fix, it seems sane to just ignore them. The easiest "solution" would be to just change the docs to reflect 2). Given that I am rather an advocate of a proper parsing-serialization round trip I'd love to see at least a test case based on the above example. But in the end it comes down to what |
I agree with you @tcurdt that the browser implementations shouldn't necessarily impact the approach used in However, the spec is very complex and the browser implementations in Chrome & Firefox serve as two well-maintained implementations which can serve as a cross reference. In the original report above, the Chrome & Firefox (and also html5lib) all agree with The documentations mentions Are we confident there is a mistake in the implementation of the spec? |
@TomAnthony but you left out the most relevant part for this issue when quoting
which gives this a completely different spin. TBH I don't have enough knowledge of the spec to make any educated comments - but I do think it's surprising that there are whitespace changes happening on a In my use case I wanted to parse a HTML file, change something and then write it back. Especially given that humans would not use whitespace as rendered this is a problem for that task at hand. Ideally the round trip would not change a thing. As a second choice I would use my own renderer of the DOM tree and so be in control of the rendered output - but I didn't see an obvious way to do this yet. |
@tcurdt - yeah, I definitely see your point and I agree it seems surprising. However, I am unsure exactly how to define 'well-formed'... The spec states However, rather than being 'dropped on the floor' the white space between your Either way, I think the white space between I think the behaviour that is interrupting your process is entirely prompted by whitespace outside of the |
If there isn't really a good path in code to solve this I think the "well-formed" part should be removed from the documentation - especially if there is a uncertainty of the definition. That would create much less confusion given the start of the sentence. As for the actual behaviour it kind of means there are two representations of the DOM. The input as usually given by humans and the rendered version that conforms to the spec. So unless a non-standard DOM is allowed to exist there is no way to make the roundtrip work without a hitch. It is indeed mostly the white space outside the body element that is giving me grief. Working around this by dealing with fragments and then combining the strings feels like a hack and quite fragile. I think there is a place for both the spec compliant and the non-spec compliant representation. But that's just not supported. I guess I either need to make my peace with the change of whitespace or write my own parser/dom. A bit of a shame. Since I doubt the |
I'm just a 'visitor' here - I recently had an issue in I disagree that the docs should be updated (or at least the segment you referenced) as that talks about the rendering step, but your issue occurs in the parsing step. The output of render produces an identical tree if fed back into parse. I actually think if you want to maintain "ill-formed" whitespace outside of the body, and your changes are all in the body, that the fragment approach would be an ok approach. I think trying to create your own parser/renderer would be a recipe for pain! Sorry I cannot be more help! 😞 |
Well, the "how this should even work" is probably the least of the problems. Without even looking at the implementation I am pretty sure that at some part of the way of building the DOM some nodes are left out are fixed for compliance reasons. A simple KISS parser would just build the "illegal" DOM as is. You have a point on the rendering step though.
it does not make any claims on parsing the original input but just the rendering of the input. At hindsight looking at the documentation of
From this it should be clear that a roundtrip cannot be ensured. Writing a parser that just builds the given html into a (non-compliant) DOM really is no rocket science. I assume one could even use the existing tokenizer. As hinted above it will be much simpler than what the current implementation does. I just hoped to avoid it. |
The problem with creating a "non-compliant DOM" is the DOM is designed to be a tree structure. Some badly formed HTML cannot form a tree: <form><div></form></div> This cannot be turned into a tree and re-serialised as it is. The HTML 5 spec has algorithms designed to handle this sort of case and re-nest things correctly. There are many other such cases. However assuming your input is guaranteed to be well-formed (in terms of tag nesting, rather than what tags/tokens are allowed where), then you may find an existing parser that works (maybe lxml/BeautifulSoup?). Good luck! 🙂 |
Yes, one has to draw a line somewhere. It's gotta be a tree :) |
@tcurdt import lxml.html
foo = """<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>
<body>
body content <p>more content</p>
</body>
</html>
"""
bar = lxml.html.fromstring(foo)
print(lxml.html.tostring(bar)) |
...but that's not in go :) |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I am parsing and then rendering html.
What did you expect to see?
With the docs saying:
Given that the HTML is well-formed I'd expect the output be the same as the input.
What did you see instead?
Instead I am seeing changes in whitespace:
IMO there should be a test case verifying that the output matches in input for the documented case.
The text was updated successfully, but these errors were encountered: