Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

Open
danielbeaudreau opened this issue Jan 3, 2019 · 3 comments

Comments

@danielbeaudreau
Copy link

commented Jan 3, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

amd64, linux

What did you do?

  1. Call charset.Lookup("us-ascii") to get an encoding
    https://godoc.org/golang.org/x/net/html/charset#Lookup

e, _ := charset.Lookup("us-ascii")

  1. Call encoding.NewDecoder().String(string([]byte{0x80})) and get the result

res, _ := e.NewDecoder().String(string([]byte{0x80}))

  1. Print the result

What did you expect to see?

I expect to see a � character, since the value is outside of the range of US-ASCII (most significant bit is 1)

What did you see instead?

I will see the character € from the Windows 1252 encoding instead. This is caused because go is re-using Windows 1252 for US-ASCII. Similar issues arise for out of bounds characters in other charactersets, for example tis-620 maps to windows874. Now if I want to correctly parse the text I need to read through the decoded runes and test if any of them are out of bounds. If I want to use windows874 for just tis-620 characters, I would have to do a similar manual exclusion of out of bounds characters. I do not know of a way to create my own characterset so that these problems can be avoided.

@Wessie

This comment has been minimized.

Copy link

commented Jan 3, 2019

this seems to follow https://www.w3.org/TR/encoding/#names-and-labels where us-ascii is an alias for windows 1252

@danielbeaudreau

This comment has been minimized.

Copy link
Author

commented Jan 3, 2019

Interesting, but US-ASCII and Windows1252 are different charactersets (windows 1252 is a superset).
See https://en.wikipedia.org/wiki/ASCII#Character_set
and https://en.wikipedia.org/wiki/Windows-1252#Character_set

Other languages like Java allow users to differentiate between these character sets. For decoding legacy text, it is not ideal to use a superset characterset like windows1252, because invalid characters which are not expressible in the subset character set can be inserted into the result if the user's text is invalid. This requires developers to implement workarounds to ensure invalid text is not contained in the result.

@katiehockman katiehockman added this to the Unplanned milestone Jan 4, 2019

@katiehockman

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2019

/cc @rsc

@katiehockman katiehockman changed the title Go's reuse of character sets causes incorrect decoding of invalid input src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019

@katiehockman katiehockman changed the title src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019

@agnivade agnivade changed the title encoding: Go's reuse of character sets causes incorrect decoding of invalid input x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.