Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comment about jq and binary #29

Closed
wader opened this issue Mar 30, 2023 · 3 comments
Closed

Comment about jq and binary #29

wader opened this issue Mar 30, 2023 · 3 comments

Comments

@wader
Copy link

wader commented Mar 30, 2023

Hi! not sure if you saw my comment about "Binary data in jq" https://fosstodon.org/@wader/110105732157084520 i think the issue is that JSON strings are not binary safe and that \u00e8 means unicode codepoint 232 and not the byte 0xe8, but convert to ISO-8859-1 it's encoded as that byte.

@polettix
Copy link
Owner

Hi! I think that the fosstodon server is semi-blocked from the server I'm on, so I didn't see your comment.

Thanks for the observation, because I think it made me realize one thing that was not clear to me and that will force me to reconsider one thing in Romeo.

TL;DR: expecting jq to read my mind and give me exactly the representation I'd like is foolish.

Now with the long part.

As I understand the matter (and I have to admit that I do not understand it), Unicode lives in its universe made of characters and sometimes we want to represent those characters as binary strings, chunked in bytes. This is where encoding comes in. When we want characters back, we have to know which encoding was used, then apply its rules in reverse and get back to the characters.

This being the "main" use case about encoding(/decoding), it makes sense to question whether the round-trip loses stuff in the process or we can go there and back again.

In this case, though, I was attempting something different: start from a stream of bytes (not characters), pretend it's encoding a very small subset of 256 Unicode code points (0x00 to 0xFF) and figure a different representation for them, which is just another encoding with the \uXXXX sequences inside. Then I am expecting to receive that initial "encoding" back from jq, but I get a different one instead.

This makes sense, because...

  • the encoding-to-encoding transformation at the beginning was not done by jq, nor jq knows anything about it
  • jq reads a stream of bytes and decodes it into a stream of characters
  • last, jq re-encodes the stream of characters to produce a stream of bytes (which are the "lingua franca" across processes). It chooses to be consistent and to produce this stream encoded in UTF-8.

So I see no fault in jq per-se; I'm just pointing out that although we have a way to encode a byte stream into something that can be put into a string valid for JSON and jq, we are not going to get it back afterwards, so we have to look for different encodings that are designed expressely and explicitly with representing binary data.

Put into another way, JSON strings "𝄞" and "\uD834\uDD1E" are two different representations of the same string of 1 Unicode character. When jq reads them, it turns them into characters (whatever internal representation it might have, of course) and it's perfectly fine that it forgets about how the initial representation was. When it gives it back, it gives it back in another perfectly valid representation of that character, which happens to be the same as the first one:

$ printf '["%s"]' '𝄞' | jq .
[
  "𝄞"
]

$ printf '["%s"]' '\uD834\uDD1E' | jq .
[
  "𝄞"
]

So while the round-trip works for characters->bytes->characters, it does necessarily work for bytes->characters->bytes.

@wader
Copy link
Author

wader commented Mar 31, 2023

Hi! I think that the fosstodon server is semi-blocked from the server I'm on, so I didn't see your comment.

Aha no worries! that is good to know

Thanks for the observation, because I think it made me realize one thing that was not clear to me and that will force me to reconsider one thing in Romeo.

TL;DR: expecting jq to read my mind and give me exactly the representation I'd like is foolish.

Now with the long part.

As I understand the matter (and I have to admit that I do not understand it), Unicode lives in its universe made of characters and sometimes we want to represent those characters as binary strings, chunked in bytes. This is where encoding comes in. When we want characters back, we have to know which encoding was used, then apply its rules in reverse and get back to the characters.

This being the "main" use case about encoding(/decoding), it makes sense to question whether the round-trip loses stuff in the process or we can go there and back again.

Yeap that is my understand of the relation between Unicode and UTF/other encodings also.

In this case, though, I was attempting something different: start from a stream of bytes (not characters), pretend it's encoding a very small subset of 256 Unicode code points (0x00 to 0xFF) and figure a different representation for them, which is just another encoding with the \uXXXX sequences inside. Then I am expecting to receive that initial "encoding" back from jq, but I get a different one instead.

This makes sense, because...

  • the encoding-to-encoding transformation at the beginning was not done by jq, nor jq knows anything about it
  • jq reads a stream of bytes and decodes it into a stream of characters
  • last, jq re-encodes the stream of characters to produce a stream of bytes (which are the "lingua franca" across processes). It chooses to be consistent and to produce this stream encoded in UTF-8.

So I see no fault in jq per-se; I'm just pointing out that although we have a way to encode a byte stream into something that can be put into a string valid for JSON and jq, we are not going to get it back afterwards, so we have to look for different encodings that are designed expressely and explicitly with representing binary data.

Put into another way, JSON strings "𝄞" and "\uD834\uDD1E" are two different representations of the same string of 1 Unicode character. When jq reads them, it turns them into characters (whatever internal representation it might have, of course) and it's perfectly fine that it forgets about how the initial representation was. When it gives it back, it gives it back in another perfectly valid representation of that character, which happens to be the same as the first one:

$ printf '["%s"]' '𝄞' | jq .
[
  "𝄞"
]

$ printf '["%s"]' '\uD834\uDD1E' | jq .
[
  "𝄞"
]

So while the round-trip works for characters->bytes->characters, it does necessarily work for bytes->characters->bytes.

Yes jq is "stuck" with JSON's string representation. But i'm not sure what the UTF-8 speciation says about invalid byte combinations in UTF-8, i guess it's up to each implementation? out of curiosity i tried to decode every possible byte with jq to see how it behaves:

# using fq to generate a blob of 0-255 bytes
# -Rs is raw slurp, read input as UTF-8 instead of JSON
# -r is raw output, aka UTF-8 if value is a string, not JSON
# -j use no output separator, to skip the new line
$ fq -n '[range(256)] | tobytes' | jq -Rsrj | xxd
00000000: 0001 0203 0405 0607 0809 0a0b 0c0d 0e0f  ................
00000010: 1011 1213 1415 1617 1819 1a1b 1c1d 1e1f  ................
00000020: 2021 2223 2425 2627 2829 2a2b 2c2d 2e2f   !"#$%&'()*+,-./
00000030: 3031 3233 3435 3637 3839 3a3b 3c3d 3e3f  0123456789:;<=>?
00000040: 4041 4243 4445 4647 4849 4a4b 4c4d 4e4f  @ABCDEFGHIJKLMNO
00000050: 5051 5253 5455 5657 5859 5a5b 5c5d 5e5f  PQRSTUVWXYZ[\]^_
00000060: 6061 6263 6465 6667 6869 6a6b 6c6d 6e6f  `abcdefghijklmno
00000070: 7071 7273 7475 7677 7879 7a7b 7c7d 7e7f  pqrstuvwxyz{|}~.
00000080: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000090: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
000000a0: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
000000b0: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
000000c0: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
000000d0: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
000000e0: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
000000f0: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000100: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
00000110: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000120: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000130: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
00000140: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000150: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000160: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
00000170: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000180: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000190: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
000001a0: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
000001b0: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
000001c0: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................
000001d0: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
000001e0: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
000001f0: bdef bfbd efbf bdef bfbd efbf bdef bfbd  ................

The invalid UTF-8 bytes (all?) gets convert into UTF-8 0xef 0xbf 0xbd which is unicode U+FFFD REPLACEMENT CHARACTER.

And round-trip ISO-8859-1 thru UTF-8 and back:

$ fq -n '[range(256)] | tobytes' | iconv -f iso-8859-1 -t utf-8 | jq -Rsjr | iconv -f utf-8 -t iso8859-1 | xxd
00000000: 0001 0203 0405 0607 0809 0a0b 0c0d 0e0f  ................
00000010: 1011 1213 1415 1617 1819 1a1b 1c1d 1e1f  ................
00000020: 2021 2223 2425 2627 2829 2a2b 2c2d 2e2f   !"#$%&'()*+,-./
00000030: 3031 3233 3435 3637 3839 3a3b 3c3d 3e3f  0123456789:;<=>?
00000040: 4041 4243 4445 4647 4849 4a4b 4c4d 4e4f  @ABCDEFGHIJKLMNO
00000050: 5051 5253 5455 5657 5859 5a5b 5c5d 5e5f  PQRSTUVWXYZ[\]^_
00000060: 6061 6263 6465 6667 6869 6a6b 6c6d 6e6f  `abcdefghijklmno
00000070: 7071 7273 7475 7677 7879 7a7b 7c7d 7e7f  pqrstuvwxyz{|}~.
00000080: 8081 8283 8485 8687 8889 8a8b 8c8d 8e8f  ................
00000090: 9091 9293 9495 9697 9899 9a9b 9c9d 9e9f  ................
000000a0: a0a1 a2a3 a4a5 a6a7 a8a9 aaab acad aeaf  ................
000000b0: b0b1 b2b3 b4b5 b6b7 b8b9 babb bcbd bebf  ................
000000c0: c0c1 c2c3 c4c5 c6c7 c8c9 cacb cccd cecf  ................
000000d0: d0d1 d2d3 d4d5 d6d7 d8d9 dadb dcdd dedf  ................
000000e0: e0e1 e2e3 e4e5 e6e7 e8e9 eaeb eced eeef  ................
000000f0: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff  ................

Thanks for the long reply!

@wader
Copy link
Author

wader commented Apr 2, 2023

Just read https://github.polettix.it/ETOOBUSY/2023/04/01/encoding-is-hard/ well put and thanks for the conversation!

@wader wader closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants