Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: Unmarshal isn't recognizing endashes in keys / struct tags #35287

Closed
agocs opened this issue Oct 31, 2019 · 5 comments
Closed

encoding/json: Unmarshal isn't recognizing endashes in keys / struct tags #35287

agocs opened this issue Oct 31, 2019 · 5 comments

Comments

@agocs
Copy link

@agocs agocs commented Oct 31, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13.3 darwin/amd64

Does this issue reproduce with the latest release?

Yep

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/cagocs/Library/Caches/go-build"
GOENV="/Users/cagocs/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/cagocs/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.13.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.13.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/cagocs/go/src/github.com/nautiluslabsco/foo/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/t2/wsplkh4d38q5g7yj_10qfb300000gn/T/go-build633074058=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version go1.13.3 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.13.3
uname -v: Darwin Kernel Version 19.0.0: Wed Sep 25 20:18:50 PDT 2019; root:xnu-6153.11.26~2/RELEASE_X86_64
ProductName:	Mac OS X
ProductVersion:	10.15
BuildVersion:	19A583
lldb --version: lldb-1001.0.13.3
  Swift-5.0

What did you do?

https://play.golang.org/p/El7R8jXFCRA

I have some JSON data that looks like this:

{
  "struct – tag": "endash",
  "struct - tag": "regular dash"
}

and a type called SomeStruct defined like this:

 type SomeStruct struct {
         SomeString    string `json:"struct – tag"`
         AnotherString string `json:"struct - tag"`
}

where the struct tag on SomeString contains an En Dash, U+2013, and the struct tag on AnotherString contains a Hyphen Minus, U+002D.

I'm calling json.Unmarshal() on my JSON data, trying to unmarshal it into a SomeStruct.

What did you expect to see?

I'd expect to see a SomeStruct equal to

SomeStruct {
  SomeString: "endash",
  AnotherString: "regular dash",
}

What did you see instead?

Instead, I see a SomeStruct equal to

SomeStruct {
  SomeString: "",
  AnotherString: "regular dash",
}

I suspect this is because json.Unmarshal is having trouble with the endash in the key struct – tag or the Struct Tag json:"struct – tag".

@tandr

This comment has been minimized.

Copy link

@tandr tandr commented Oct 31, 2019

So you have a struct with 2 fields with the same json tag, am I reading it correctly?

@agocs

This comment has been minimized.

Copy link
Author

@agocs agocs commented Oct 31, 2019

Good question. The struct tag on SomeString contains an En Dash, U+2013. The struct tag on AnotherString contains a Hyphen Minus, U+002D.

Hyphen Minus is the dash we're used to seeing on a typical keyboard, between 0 and =.

@tandr

This comment has been minimized.

Copy link

@tandr tandr commented Oct 31, 2019

Oh... (The information that you have provided should be a part of the ticket description (imho) - I missed unicode's thing.)

Is possible that editor (or source control) that you have used at some point got too smart, and replaced en dash with minus because it thought you are opening an ascii file instead of unicode? Like, visually they are (almost) the same...

If anything, there are more than one intermediary in between code and running binary - editors, compiler that understands the tags, json library itself, maybe some other. Sorry, I don't know how json is treated (by JSON spec, or by libraries) - does it allow to have UTF-8 tags?

@agocs

This comment has been minimized.

Copy link
Author

@agocs agocs commented Oct 31, 2019

I realized my ticket was too vague and added notes about the codepoints to the ticket. Thank you for the feedback.

If you run https://play.golang.org/p/VxJcyJXHyAd, I have it dump the raw bytes of the JSON. The en dash is UTF-8 encoded, and comes up as e2 80 93. I've used the reflect package to get the raw bytes of the struct tag on SomeString, and you'll notice the same e2 80 93 dash.

According to the JSON spec at https://www.json.org/, the key to a JSON object must be a string, defined as:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

However, if we look at https://golang.org/src/encoding/json/encode.go?s=6471:6514#L837, there's an isValidTag function, which checks to see if each rune in a tag is a Unicode letter or Unicode digit and not in a small set of reserved characters. En dash is neither a Unicode Letter or Digit, but it should be a valid JSON key.

@agocs agocs changed the title json.Unmarshal isn't handling endashes in keys / struct tags correctly json.Unmarshal isn't recognizing endashes in keys / struct tags Oct 31, 2019
@FiloSottile FiloSottile changed the title json.Unmarshal isn't recognizing endashes in keys / struct tags encoding/json: Unmarshal isn't recognizing endashes in keys / struct tags Nov 5, 2019
@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Nov 5, 2019

This is documented in the json.Marshal docs (referenced by the Unmarshal ones):

The key name will be used if it's a non-empty string consisting of only Unicode letters, digits, and ASCII punctuation except quotation marks, backslash, and comma.

@FiloSottile FiloSottile closed this Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.