Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: unexpected handling of non-UTF-8 input #16282

Closed
adityats opened this issue Jul 6, 2016 · 11 comments

Comments

Projects
None yet
6 participants
@adityats
Copy link

commented Jul 6, 2016

  1. What version of Go are you using (go version)?
    1.6.2
  2. What operating system and processor architecture are you using (go env)?
    GOARCH="amd64"
    GOBIN=""
    GOEXE=""
    GOHOSTARCH="amd64"
    GOHOSTOS="linux"
    GOOS="linux"
    GOPATH="*****/src/avi/go"
    GORACE=""
    GOROOT="/usr/local/go"
    GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
    GO15VENDOREXPERIMENT="1"
    CC="gcc"
    GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
    CXX="g++"
    CGO_ENABLED="1"
  3. What did you do?
    I am reporting the problem from the surface and if need be will provide more details (which are complicated and hard to list here).

I have a protobuf call Pool (It is in a file called pool.pb.go).
Storage of this protobuf goes something like this:
{"obj": <Serialized/Marshalled string of the Pool protobuf>} and this is serialized and stored as one big JSON blob:

"{"obj":"\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\u0012\ttest-pool(P@\u0001H\nP\u0000Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\u0011\n\r10.160.161.11\u0010\u0000\u0010P\u001A\r10.160.161.11 \u0001(\u0001X\u0000h\u0000r\u0000\u0080\u0001\u0000h\u0001\u00A0\u0001\u0001\u00B0\u0002\u0001\u00B8\u0002\u0000\u00C2\u0002\u0002\b\u0003\u00D0\u0002\u0000\u00F0\u0002\u0000\u00A2\u0003/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab\u00B0\u0003\n\u00B8\u0003\u0001\u00E8\u0003\u0000\u00F0\u0003\u0080\u0001\u0082\u0004\u0006\b\u0000\u0018\u0004 \u0000\u0098\u0004\u0000\u00A8\u0004\u0001\u00A2\u0006\u0005admin\u00AA\u0006*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\u0082\u00A6\u001D\npool.proto\u008A\u00A6\u001D\u0004Pool"}"

  1. What did you expect to see?
    When I UnMarshal the above blob, I expect to see this:

    "obj": "\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\x12\ttest-pool(P@\x01H\nP\x00Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\x11\n\r10.160.161.11\x10\x00\x10P\x1a\r10.160.161.11 \x01(\x01X\x00h\x00r\x00\x80\x01\x00h\x01\xa0\x01\x01\xb0\x02\x01\xb8\x02\x00\xc2\x02\x02\x08\x03\xd0\x02\x00\xf0\x02\x00\xa2\x03/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab\xb0\x03\n\xb8\x03\x01\xe8\x03\x00\xf0\x03\x80\x01\x82\x04\x06\x08\x00\x18\x04 \x00\x98\x04\x00\xa8\x04\x01\xa2\x06\x05admin\xaa\x06*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\x82\xa6\x1d\npool.proto\x8a\xa6\x1d\x04Pool"

All Unicode characters are decoded to corresponding hex versions correctly.

  1. What did you see instead?

"obj": "\n)pool-6b9bad47-65ac-477c-a32a-6fd5c7a98c91\x12\ttest-pool(P@\x01H\nP\x00Z2healthmonitor-665dbe8c-aecb-49df-97b1-a87214e65270Z2healthmonitor-f6052e41-616d-46d6-989d-0fdfe0c5a403b1\n\x11\n\r10.160.161.11\x10\x00\x10P\x1a\r10.160.161.11 \x01(\x01X\x00h\x00r\x00\u0080\x01\x00h\x01\u00a0\x01\x01°\x02\x01¸\x02\x00Â\x02\x02\b\x03Ð\x02\x00ð\x02\x00¢\x03/vrfcontext-bfec9a97-7df7-4cfd-bd6d-ee84415ed1ab°\x03\n¸\x03\x01è\x03\x00ð\x03\u0080\x01\u0082\x04\x06\b\x00\x18\x04 \x00\u0098\x04\x00¨\x04\x01¢\x06\x05adminª\x06*cloud-2c2c062a-9d10-4d3b-a2e5-6ac9af08a4c4\u0082¦\x1d\npool.proto\u008a¦\x1d\x04Pool"

Only Unicode characters lesser than (u\0080) are decoded to corresponding hex versions correctly. The rest of them are kept as is (look for u\ in the above string, whereas in the correct case all of these should be decoded to the relevant hex values). This looks like the Unmarshal method in encoding/json decodes only if the input fits in 2 bytes?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 6, 2016

Can you show us a complete standalone program that demonstrates the error? It's fine to start with a constant string, ideally one shorter than the ones above.

@ianlancetaylor ianlancetaylor changed the title Protobuf unmarshalling fails with unexpected EOF encoding/json: unexpected handling of Unicode characters Jul 6, 2016

@ianlancetaylor ianlancetaylor added this to the Go1.8Maybe milestone Jul 6, 2016

@adityats

This comment has been minimized.

Copy link
Author

commented Jul 7, 2016

Here's a rudimentary program that captures the issue I was talking about - https://play.golang.org/p/uFcs2-0yEr

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

Thanks for the test case. When I run that program, I get
"ª¨"
What should it print instead?

@adityats

This comment has been minimized.

Copy link
Author

commented Jul 7, 2016

As I cannot reproduce the exact issue, this should roughly translate to "\xaa\xa8" as I am UnMarshalling in the actual use case (pls refer to the dumps in my earlier comments) for protobuf to decode this later on. Since it resolves to the characters you gave, protobuf UnMarshal bails out.

@adityats adityats closed this Jul 7, 2016

@adityats adityats reopened this Jul 7, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

In your test program, the resulting string (out.Obj) == "\u00AA\u00A8". That makes sense to me. I don't see why it should be "\xaa\xa8". The input characters are handled as Unicode characters represented in UTF-8.

@robpike

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

https://play.golang.org/p/JWgGUhzYBj (which is the same except for the output format) shows exactly what I would expect the program to produce.

@adityats

This comment has been minimized.

Copy link
Author

commented Jul 7, 2016

thanks for getting back. The above said example was a very basic one which didn't capture the issue I reported at first. Just posting a snippet from my earlier comment,

\u0018\u0004 \u0000\u0098\u0004\u0000\u00A8 this string when we UnMarshal should ideally have been resolved to this \x18\x04 \x00\x98\x04\x00\xa8, instead it converts partially to \x18\x04 \x00\u0098\x04\x00¨ where anything lesser than u\0080 is converted to the corresponding hex but anything higher than that is preserved as is (u\0098 is preserved in the final output, ideally should have been \x98). I am thinking the unmarshalling library is only detecting 2 bytes to be converted to hex and leaves anything bigger to as is.

Sorry, to reproduce this, I am not sure what I can provide to you apart from these details.

@robpike

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

Please provide a complete working example. Without one it's difficult to understand the problem you are seeing.

I trust you know that \u0098 is a Unicode character that when stored as text will not be the bytes \x00\x98 but instead the UTF-8 encoding \xc2\x98. When you say "ideally" \u0098 will become \x00\x98 you appear to be missing this point.

To put it another way, in a string \uNNNN does not represent a byte, it represents the byte sequence necessary to represent the Unicode code point NNNN.

@adityats

This comment has been minimized.

Copy link
Author

commented Jul 7, 2016

I understand UTF-8 encoding character set, but in the working case I see that not being applied. In the example above \u0098 apparently resolves to \x98 and when I send this data to proto.Unmarshal, it decodes correctly to the intended protobuf, hence the confusion.

I will try to get an isolated example with the actual protobuf I am using.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2016

JSON is not appropriate for encoding arbitrary binary data. You need to do something else to it first, like base64 encode it.
JSON only encodes Unicode data: it has no \x, only \u. Invalid UTF-8 will not survive the round trip into \u. See http://www.json.org for the definition.

@rsc rsc changed the title encoding/json: unexpected handling of Unicode characters encoding/json: unexpected handling of non-UTF-8 input Oct 20, 2016

@rsc rsc closed this Oct 20, 2016

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2016

Actually, if you make your field have type []byte instead of type string, json will do the base64 for you.

@golang golang locked and limited conversation to collaborators Oct 20, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.