Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: "€" is incorrectly encoded as "\x80" in GB18030 (should have been "\xA2\xE3") #48691

Closed
kennytm opened this issue Sep 30, 2021 · 3 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@kennytm
Copy link

kennytm commented Sep 30, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Playground.

https://play.golang.org/p/ELVa9gu69JX

What did you do?

Encode "€" (U+20AC).

package main

import (
	"fmt"
	"golang.org/x/text/encoding/simplifiedchinese"
)

func main() {
	s, _ := simplifiedchinese.GB18030.NewEncoder().String("€")
	if s != "\xA2\xE3" {
		panic(fmt.Sprintf("%x", s))
	}
}

What did you expect to see?

Pass.

What did you see instead?

panic: 80

goroutine 1 [running]:
main.main()
	/tmp/sandbox2937657549/prog.go:11 +0xb9

The character "€" is encoded into "\x80". This is a propriety extension by Microsoft in CP936 (GBK). In GB18030 standard the proper encoding should be "\xA2\xE3".

While the GB18030 decoder recognizes both "\x80" and "\xA2\xE3" as "€" (as it should), the encoder should prefer to generate "\xA2\xE3" over the non-standard "\x80".

@gopherbot gopherbot added this to the Unreleased milestone Sep 30, 2021
@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 30, 2021
AlexanderYastrebov added a commit to AlexanderYastrebov/text that referenced this issue Oct 3, 2021
The euro sign is an exception which is given a single byte code of 0x80
in Microsoft's later versions of CP936/GBK and a two byte code of A2 E3
in GB18030. https://en.wikipedia.org/wiki/GB_18030#cite_note-4

Fixes golang/go#48691
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/353712 mentions this issue: encoding/simplifiedchinese: Fixes € encoding in GB18030

@AlexanderYastrebov

This comment has been minimized.

@AlexanderYastrebov
Copy link
Contributor

According to https://encoding.spec.whatwg.org/#gbk-decoder and https://encoding.spec.whatwg.org/#gb18030-decoder

GBK’s decoder is gb18030’s decoder.
If byte is 0x80, return code point U+20AC.

i.e. both should decode \x80 and \xA2\xE3.

According to https://encoding.spec.whatwg.org/#gbk-encoder and https://encoding.spec.whatwg.org/#gb18030-encoder

GBK’s encoder is gb18030’s encoder with its is GBK set to true.
If is GBK is true and code point is U+20AC, return byte 0x80.

@golang golang locked and limited conversation to collaborators Oct 4, 2022
xhit pushed a commit to xhit/text that referenced this issue Oct 10, 2022
The euro sign is an exception which is given a single byte code of 0x80
in Microsoft's later versions of CP936/GBK and a two byte code of A2 E3
in GB18030. https://en.wikipedia.org/wiki/GB_18030#cite_note-4

Fixes golang/go#48691

Change-Id: I6a4460274d4313ad1d03bcd8070373af674691eb
GitHub-Last-Rev: acbbc50
GitHub-Pull-Request: golang#26
Reviewed-on: https://go-review.googlesource.com/c/text/+/353712
Reviewed-by: Nigel Tao <nigeltao@golang.org>
Trust: Nigel Tao <nigeltao@golang.org>
Trust: Alberto Donizetti <alb.donizetti@gmail.com>
Run-TryBot: Nigel Tao <nigeltao@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants