Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 8.0 defines u+1f917 but ... #2005

Closed
pkoppstein opened this issue Nov 3, 2019 · 1 comment
Closed

Unicode 8.0 defines u+1f917 but ... #2005

pkoppstein opened this issue Nov 3, 2019 · 1 comment

Comments

@pkoppstein
Copy link
Contributor

Describe the bug

http://www.unicode-symbol.com/u/1F917.html gives details about 🤗 - hugging face (U+1F917).
jq can handle it in the sense that:

$ jq -n '"🤗"'
"🤗"

However jq seems to be quite confused about the details:

$ jq --version
jq-master-2e01ff1 # Release jq-1.6 of Nov 1, 2018

$ jq -n '"🤗" | explode'
[
129303
]

$ jq -n '[12903] | implode'
"㉧"

The following also does not look right:
$ jq -n '"\u1f917" | explode'
[
8081,
55
]
$ jq -n '[8081,55] | implode'

"ᾑ7"

Environment (please complete the following information):

  • OS and Version: macOS High Sierra
@wtlangford
Copy link
Contributor

$ jq -n '[12903] | implode'
"㉧"

This should be:

$ jq -n '[129303] | implode'
"🤗"

The behavior of "\u1f917" is also correct, though surprising- the JSON spec (RFC 7159) in section 7 (Strings) states:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point

🤗 (U+1F917) is outside the BMP, so there's a different escape format to use, which the RFC goes on to describe:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E"

Following the UTF-16 instructions for constructing the surrogate pair (It's the UTF-16BE format in your link) for U+1F917 gives us this pair: \ud83e\udd17.

jq -n '"\ud83e\udd17"'
"🤗"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants