New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: Self-closing element results in wrong parsing #22834

Closed
ernsheong opened this Issue Nov 21, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@ernsheong
Copy link

ernsheong commented Nov 21, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.9.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/ernsheong/Gocode"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.9.2/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.9.2/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/s5/6jqwjnjd46g_pd35sxdfh_700000gn/T/go-build017081441=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Okay. Browsers are very forgiving when it comes to rendering HTML. A canvas element should be declared as <canvas width="100" height="100"></canvas>. However, if it is rendered as just <canvas width="100" height="100">, the browser would be fine with it as well. But the html.Parse() function is not as forgiving and parses this wrongly.

node, err := html.Parse(strings.NewReader(
  `<html>
	<head>
	</head>
	<body>
	<div class="foo">
                <!-- CANVAS 1 -->
		<canvas></canvas>
		<div>BAR</div>
                <!-- CANVAS 2 -->
		<canvas>
		<p>BAZ</p>
	</div>
	</body>
</html>`
))

// ... traverse through all nodes until we reach the canvas node.

What did you expect to see?

The Node for CANVAS 1 (above) behaves as expected, it has a NextSibling != nil

What did you see instead?

The Node for CANVAS 2 (above) does not behave as expected, it has a NextSibling == nil. Instead, it has a FirstChild != nil and FirstChild.NextSibling != nil that holds that value of the correct NextSibling

I believe this is also related to #22298

@gopherbot gopherbot added this to the Unreleased milestone Nov 21, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Nov 21, 2017

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Nov 21, 2017

I know nothing about this package, and I believe @tombergan doesn't either.

@tombergan

This comment has been minimized.

Copy link
Contributor

tombergan commented Nov 22, 2017

Well, I know a little bit....

If I run the following program:

package main

import (
    "os"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    node, _ := html.Parse(strings.NewReader(
        `<html>
            <head>
            </head>
            <body>
            <div class="foo">
                <!-- CANVAS 1 -->
                <canvas></canvas>
                <div>BAR</div>
                <!-- CANVAS 2 -->
                <canvas>
                <p>BAZ</p>
            </div>
            </body>
        </html>`))
    html.Render(os.Stdout, node)
}

I get:

<html><head>
	</head>
	<body>
	<div class="foo">
		<!-- CANVAS 1 -->
		<canvas></canvas>
		<div>BAR</div>
		<!-- CANVAS 2 -->
		<canvas>
		<p>BAZ</p>
	</canvas></div>
			
</body></html>

If I save your original HTML to a local file, load that file in Chrome, then run document.body.outerHTML in the devtools console, I get the following, which is exactly the same as Go's output:

"<body>
	<div class="foo">
		<!-- CANVAS 1 -->
		<canvas></canvas>
		<div>BAR</div>
		<!-- CANVAS 2 -->
		<canvas>
		<p>BAZ</p>
	</canvas></div>
		
</body>"

The <canvas>'s NextSibling is nil because the second <canvas> is the last child of the <div>.
I don't see a bug. Everything looks correct. Please reopen if you disagree.

@tombergan tombergan closed this Nov 22, 2017

@ernsheong

This comment has been minimized.

Copy link

ernsheong commented Nov 22, 2017

@tombergan Please reopen. I am unable to reopen.

From your testing code above, you missed out a detail.

Note the input has no canvas at the end. <p>BAZ</p> is the final element in the div:

func main() {
    node, _ := html.Parse(strings.NewReader(
        `<html>
            <head>
            </head>
            <body>
            <div class="foo">
                <!-- CANVAS 1 -->
                <canvas></canvas>
                <div>BAR</div>
                <!-- CANVAS 2 -->
                <canvas>
                <p>BAZ</p> <!-- </CANVAS> is not present -->
            </div>
            </body>
        </html>`))
    html.Render(os.Stdout, node)
}

Note output, </canvas> magically reappeared again at the end, this time enclosing <p>BAZ</p>, which is the same issue as #22298 involving the iframe element:

<html><head>
	</head>
	<body>
	<div class="foo">
		<!-- CANVAS 1 -->
		<canvas></canvas>
		<div>BAR</div>
		<!-- CANVAS 2 -->
		<canvas>
		<p>BAZ</p>
	</canvas></div> <!-- </CANVAS> reappears!, enclosing <P> -->
			
</body></html>

The issue is that the parser is expecting a closing </canvas> tag, but there is none, and thus treats a first non-canvas tag it sees as a child (should canvas nodes have children?).

@tombergan

This comment has been minimized.

Copy link
Contributor

tombergan commented Nov 22, 2017

Note output, magically reappeared again at the end, this time enclosing

BAZ

It magically appears because the second <canvas> is never closed. The HTML spec has rules about when never-closed elements get closed. Given that Chrome and Go are both doing the same thing, I'm guessing that both are correct. It seems the rule is that </canvas> should be inserted before the containing </div>. I'm far from an expert here, but I bet the answer is buried somewhere in this section:
https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser

If you can quote section and verse from the spec showing why the above behavior is wrong, we'll reopen this issue.

which is the same issue as #22298 involving the iframe element

These are separate issues. #22298 is about a self-closing element being translated to separate start/end elements. This issue is about a closing element being added for an element that was never closed.

@ernsheong

This comment has been minimized.

Copy link

ernsheong commented Dec 8, 2017

Here's an example out in the wild, by Evernote (Chrome 62 on macOS):
image

Notice the canvas is self-closing, without a closing </canvas> tag. This was the example that caused me to have to code a workaround. Chrome/Safari is graceful about this, but the Go parser is strict, although Go parser is merely obeying spec. Firefox actually adds the closing </canvas> tag upon inspection.

Here is a real example: https://www.evernote.com/shard/s5/sh/d53a3747-d849-4b10-b17e-dbc9dee0383c/7417220218a9f310 I note that the canvas was likely JS-injected.

@golang golang locked and limited conversation to collaborators Dec 8, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.