Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html: html.UnescapeString("(&#9)") does not decode single character number #66058

Open
yffrankwang opened this issue Mar 1, 2024 · 8 comments · May be fixed by #66140
Open

html: html.UnescapeString("(&#9)") does not decode single character number #66058

yffrankwang opened this issue Mar 1, 2024 · 8 comments · May be fixed by #66140
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@yffrankwang
Copy link

Go version

go version go1.21.4 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/ubuntu/.cache/go-build'
GOENV='/home/ubuntu/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/ubuntu/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/ubuntu/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/gol.21.4'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/gol.21.4/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4093406638=/tmp/go-build -gno-record-gcc-switches'

What did you do?

the html.UnescapeString() function does not decode one character number correctly.

package main

import (
	"fmt"
	"html"
)

func main() {
	fmt.Println(html.UnescapeString("(&#9)"))
}

https://go.dev/play/p/kCEC5INrCNt

What did you see happen?

output:

(&#9)

I think the "No characters matched." check logic of function html.unescapeEntity() is incorrect.

https://cs.opensource.google/go/go/+/refs/tags/go1.22.0:src/html/escape.go;l=107

What did you expect to see?

want:

(\t)

Edge/Chrome/Firefox displays the following html as (  ).

<pre>(&#9)</pre>

https://jsfiddle.net/Lkm6jy3c/

@seankhliao
Copy link
Member

entities need to end with a semicolon
https://go.dev/play/p/KysJYYkgzbD

@seankhliao seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024
@yffrankwang
Copy link
Author

entities do not need to end with a semmicolon.

package main

import (
	"fmt"
	"html"
)

func main() {
	fmt.Println(html.UnescapeString("(&#33)"))
}

output:

(!)

https://go.dev/play/p/1EawuIJeFka

Edge/Chrome/Firefox allows entity without a semicolon.
This should be a bug, right?

@seankhliao
Copy link
Member

mdn says starts with & and ends with ;
https://developer.mozilla.org/en-US/docs/Glossary/Entity

whatwg html spec agrees
https://html.spec.whatwg.org/#character-references

Decimal numeric character reference
The ampersand must be followed by a U+0023 NUMBER SIGN character (#), followed by one or more ASCII digits, representing a base-ten integer that corresponds to a code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

Hexadecimal numeric character reference
The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more ASCII hex digits, representing a hexadecimal integer that corresponds to a code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

@yffrankwang
Copy link
Author

Yes, current HTML5 specification states that entity must ends with a semicolon.
BUT HTML4.01 allows entity without a semicolon.

https://www.w3.org/TR/1999/REC-html401-19991224/charset.html#entities

Note. HTML provides other ways to present character data, in particular inline images.

Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

If html.UnescapeString() follow HTML5 specification, how do you explain the following code's output.

package main

import (
	"fmt"
	"html"
)

func main() {
	fmt.Println(html.UnescapeString("(&#33)"))
}

output:

(!)

@seankhliao
Copy link
Member

see previously #21563

@yffrankwang
Copy link
Author

#21563 (comment)

says:
This means that

<html><body>&#1;</body></html>

and

<html><body>&#1</body></html>

parse to the same document.


BUT, current html.UnescapeString() behave differently.
https://go.dev/play/p/45xKzS91su5

import (
	"fmt"
	"html"
)

func main() {
	fmt.Printf("%q\n", html.UnescapeString("(&#1;)"))
	fmt.Printf("%q\n", html.UnescapeString("(&#1)"))
}

output:

"(\x01)"
"(&#1)"

if

<html><body>&#1;</body></html>
<html><body>&#1</body></html>

means same document.

html.UnescapeString("(&#1;)"

should equals to

html.UnescapeString("(&#1)")

@yffrankwang
Copy link
Author

@seankhliao Please reopen this issue.

This is clearly a bug. This code explains everything.

package main

import (
	"fmt"
	"html"
)

func main() {
	fmt.Printf("%q\n", html.UnescapeString("(&#x9)"))
	fmt.Printf("%q\n", html.UnescapeString("(&#9)"))
}

output:

"(\t)"
"(&#9)"

https://go.dev/play/p/3izgk-VqKG4

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 4, 2024
@seankhliao seankhliao reopened this Mar 4, 2024
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Mar 6, 2024
Fix handling of "&golang#9" and add tests for other single-digit cases.

Fixes golang#66058
Updates golang#21563
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Mar 6, 2024
Fix handling of "&golang#9" and add tests for other single-digit cases.

Fixes golang#66058
Updates golang#21563
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Mar 6, 2024
Fix handling of "&golang#9" and add tests for other single-digit cases.

Fixes golang#66058
Updates golang#21563
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Mar 6, 2024
Fix handling of "&golang#9" and add tests for other single-digit cases.

Fixes golang#66058
Updates golang#21563
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/569456 mentions this issue: html: handle single digit decimal numeric entities without semicolon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants