Skip to content

html: UnescapeString unescapes HTML character references without a final semicolon #21563

@stjj89

Description

@stjj89

What version of Go are you using (go version)?

go version go1.9rc2_cl165246139 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

html.UnescapeString treats HTML character references that are missing a final ; as valid character references and escapes them. For example, &#58 is unescaped to :.

https://play.golang.org/p/oyPAjmj0s_

The HTML5 specification states that all valid character references must be terminated by a ; character.

https://www.w3.org/TR/html5/syntax.html#character-references

Therefore, character references such as &#58 that are missing this semicolon should not be unescaped.

Note: the authors of this function probably intended to accept unterminated character references (see this test case). This was probably to handle an edge case mentioned in the HTML4 spec (https://www.w3.org/TR/html4/charset.html#entities):

In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions