Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow full unicode range #687

Closed
andimarek opened this issue Feb 8, 2020 · 7 comments · Fixed by #849
Closed

allow full unicode range #687

andimarek opened this issue Feb 8, 2020 · 7 comments · Fixed by #849
Labels
💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md)

Comments

@andimarek
Copy link
Contributor

andimarek commented Feb 8, 2020

These are my proposed changes to the spec to allow for full unicode range (currently it is restricted to BMP code points. See SourceCharacter)

1. Change SourceCharacter to allow also code points between 0xFFFF and 0x10FFFF (outside of the BMP):

SourceCharacter ::
  - "U+0009"
  - "U+000A"
  - "U+000D"
  - "U+0020–U+10FFFF"

This does not cover all unicode code points: most of the Control Characters are not allowed. This is the same behavior as now and I don't see a reason to change it: the only places where Control Characters could be allowed are inside comments or String literals. Inside Strings you can escape them and inside comments they don't really make sense or you can easily work around it.

Changing it to allow for Control Characters to be included would also add an additional burden on systems processing GraphQL documents. Most importantly JSON also requires Control Characters to be escaped (https://tools.ietf.org/html/rfc8259#section-7).

2. Allow surrogate code pair escapes in standard quoted strings:

Currently standard quoted strings allow for BMP code points to be escaped. (Via \u<4-digit-hex-value>.) In order to align this with the SourceCharacter change above the spec should allow also code points outside of the BMP to be escaped. Surrogate Pairs are the most direct way to allow for that.

For example the unicode code point U+1F37A ( 🍺 ) which is outside of the BMP can be escaped as \ud83c\udf7a

There are other escapes sequences used for code points outside of the BMP. For example JS and others allow for \u{1F37A}. But this would introduce a new syntax. I argue that surrogate code pairs are the most compatible and simplest option. JSON for example understands surrogate code pairs but not \u{1F37A}.

One small open question is how illegal surrogate pairs should be handled:
For example \ud83c\u0020 or \uDEAD is such an illegal pair.

The JS spec says:

A code unit that is a leading surrogate or trailing surrogate, but is not part of a surrogate pair, is interpreted as a code point with the same value.

The JSON spec notes:

However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.

I would recommend to add a section that servers should try to reject illegal surrogate pairs if possible in order to avoid unexpected behavior.

Previous discussion

Previous Issue about it: #214
Previous PR which was not merged: #231

Please comment, leave feedback.

The JS PR for this change is here: graphql/graphql-js#2449

@wonderlandpark
Copy link

Any Updates for it? I really need this

@acao
Copy link
Member

acao commented Sep 6, 2020

@wonderlandpark given it's current stage, it warrants further discussion. If you want to advocate for this, you can get together with other proposal authors/advocates, ask if there are ways to help advance it, and most importantly, propose adding it to the next graphql working group agenda and attend the next monthly working group meeting!

@dkbarn
Copy link

dkbarn commented Apr 12, 2021

One point that I'm still confused about is the use of surrogate pairs for unicode characters outside the Basic Multilingual Plane. According to the JSON spec, it already encodes such characters using UTF-16 surrogates. It seems that it does this for historical reasons, for compatibility with older implementations of Javascript. (See: https://stackoverflow.com/a/38552626/6686740 )

If GraphQL is using JSON to serialize string literals, then it's already producing surrogates for characters outside the BMP. So in what respects is GraphQL not yet compatible with these characters?

An issue that I've hit with the Python graphql-core (see linked ticket above), is that json.dumps() is used to serialize string literals in print_ast(), but its counterpart json.loads() is not used when parsing the serialized string back into an AST. This means that the surrogate pairs written out by json.dumps() are left as surrogate pairs and do not get converted back into the original unicode character, which the JSON library would do.

Is this correct behavior? Is the idea that the GraphQL implementation itself is supposed to interpret the escaped characters produced by JSON serialization? Or should the JSON library be used in both directions?

leebyron added a commit that referenced this issue Apr 13, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.
leebyron added a commit that referenced this issue Apr 13, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 13, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
@leebyron
Copy link
Collaborator

If GraphQL is using JSON to serialize string literals, then it's already producing surrogates for characters outside the BMP. So in what respects is GraphQL not yet compatible with these characters?

The serialized data likely already is supporting non-BMP characters since it's typically directly translating internal string values to JSON.

The proposal here is for strings within the actual GraphQL document text, for example: { field(arg: "\ud83c\udf7a") } - in the current spec there is no defined behavior for this case (or actually more technically, current string semantics produces two invalid surrogate code points in the string - but in practice some underlying implementations may handle the surrogate pairs themselves).

leebyron added a commit that referenced this issue Apr 13, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 13, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 15, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 15, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 15, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 15, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Apr 15, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue May 18, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue May 27, 2021
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
@slaratte
Copy link

slaratte commented Jul 1, 2021

Hello there, I don't if this the right place, but I'm facing an issue with the current version of library where special characters are not being properly encoded, my application receives a lot of Spanish strings with ñ or other accented characters and those get lost between the client and the API. I just want to know if that would solve my issue. Thanks

@leebyron
Copy link
Collaborator

leebyron commented Jul 1, 2021

@slaratte that seems unrelated to this spec advancement. I'd check with an issue on whatever specific software or framework you happen to be using. It sounds like bad content encoding somewhere in your stack

@andimarek
Copy link
Contributor Author

andimarek commented Jan 10, 2022

This is actually implemented in the spec now.

leebyron added a commit that referenced this issue Jun 2, 2022
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Jun 2, 2022
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)
leebyron added a commit that referenced this issue Jun 3, 2022
This spec text implements #687 (full context and details there) and also introduces a new escape sequence.

Three distinct changes:

1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF.
2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error.
3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

(As a bonus, this removes the last instance of a regex in the lexer grammar!)

Co-authored-by: Andreas Marek <andimarek@fastmail.fm>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants