allow full unicode range #687

andimarek · 2020-02-08T05:29:59Z

These are my proposed changes to the spec to allow for full unicode range (currently it is restricted to BMP code points. See SourceCharacter)

1. Change SourceCharacter to allow also code points between 0xFFFF and 0x10FFFF (outside of the BMP):

SourceCharacter ::
  - "U+0009"
  - "U+000A"
  - "U+000D"
  - "U+0020–U+10FFFF"

This does not cover all unicode code points: most of the Control Characters are not allowed. This is the same behavior as now and I don't see a reason to change it: the only places where Control Characters could be allowed are inside comments or String literals. Inside Strings you can escape them and inside comments they don't really make sense or you can easily work around it.

Changing it to allow for Control Characters to be included would also add an additional burden on systems processing GraphQL documents. Most importantly JSON also requires Control Characters to be escaped (https://tools.ietf.org/html/rfc8259#section-7).

2. Allow surrogate code pair escapes in standard quoted strings:

Currently standard quoted strings allow for BMP code points to be escaped. (Via \u<4-digit-hex-value>.) In order to align this with the SourceCharacter change above the spec should allow also code points outside of the BMP to be escaped. Surrogate Pairs are the most direct way to allow for that.

For example the unicode code point U+1F37A ( 🍺 ) which is outside of the BMP can be escaped as \ud83c\udf7a

There are other escapes sequences used for code points outside of the BMP. For example JS and others allow for \u{1F37A}. But this would introduce a new syntax. I argue that surrogate code pairs are the most compatible and simplest option. JSON for example understands surrogate code pairs but not \u{1F37A}.

One small open question is how illegal surrogate pairs should be handled:
For example \ud83c\u0020 or \uDEAD is such an illegal pair.

The JS spec says:

A code unit that is a leading surrogate or trailing surrogate, but is not part of a surrogate pair, is interpreted as a code point with the same value.

The JSON spec notes:

However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.

I would recommend to add a section that servers should try to reject illegal surrogate pairs if possible in order to avoid unexpected behavior.

Previous discussion

Previous Issue about it: #214
Previous PR which was not merged: #231

Please comment, leave feedback.

The JS PR for this change is here: graphql/graphql-js#2449

The text was updated successfully, but these errors were encountered:

wonderlandpark · 2020-09-05T02:08:46Z

Any Updates for it? I really need this

acao · 2020-09-06T15:14:20Z

@wonderlandpark given it's current stage, it warrants further discussion. If you want to advocate for this, you can get together with other proposal authors/advocates, ask if there are ways to help advance it, and most importantly, propose adding it to the next graphql working group agenda and attend the next monthly working group meeting!

dkbarn · 2021-04-12T18:34:38Z

One point that I'm still confused about is the use of surrogate pairs for unicode characters outside the Basic Multilingual Plane. According to the JSON spec, it already encodes such characters using UTF-16 surrogates. It seems that it does this for historical reasons, for compatibility with older implementations of Javascript. (See: https://stackoverflow.com/a/38552626/6686740 )

If GraphQL is using JSON to serialize string literals, then it's already producing surrogates for characters outside the BMP. So in what respects is GraphQL not yet compatible with these characters?

An issue that I've hit with the Python graphql-core (see linked ticket above), is that json.dumps() is used to serialize string literals in print_ast(), but its counterpart json.loads() is not used when parsing the serialized string back into an AST. This means that the surrogate pairs written out by json.dumps() are left as surrogate pairs and do not get converted back into the original unicode character, which the JSON library would do.

Is this correct behavior? Is the idea that the GraphQL implementation itself is supposed to interpret the escaped characters produced by JSON serialization? Or should the JSON library be used in both directions?

This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.

This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)

leebyron · 2021-04-13T17:22:42Z

If GraphQL is using JSON to serialize string literals, then it's already producing surrogates for characters outside the BMP. So in what respects is GraphQL not yet compatible with these characters?

The serialized data likely already is supporting non-BMP characters since it's typically directly translating internal string values to JSON.

The proposal here is for strings within the actual GraphQL document text, for example: { field(arg: "\ud83c\udf7a") } - in the current spec there is no defined behavior for this case (or actually more technically, current string semantics produces two invalid surrogate code points in the string - but in practice some underlying implementations may handle the surrogate pairs themselves).

This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)

slaratte · 2021-07-01T03:25:44Z

Hello there, I don't if this the right place, but I'm facing an issue with the current version of library where special characters are not being properly encoded, my application receives a lot of Spanish strings with ñ or other accented characters and those get lost between the client and the API. I just want to know if that would solve my issue. Thanks

leebyron · 2021-07-01T04:08:23Z

@slaratte that seems unrelated to this spec advancement. I'd check with an issue on whatever specific software or framework you happen to be using. It sounds like bad content encoding somewhere in your stack

andimarek · 2022-01-10T00:43:04Z

This is actually implemented in the spec now.

This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)

This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!) Co-authored-by: Andreas Marek <andimarek@fastmail.fm>

This was referenced Feb 8, 2020

Allow unicode characters outside the Basic Multilingual Plane #214

Closed

clarification about unicode encoding #675

Closed

add full unicode support graphql/graphql-js#2449

Closed

leebyron added the 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) label Mar 5, 2020

dotansimha mentioned this issue Apr 20, 2020

Parameters with non-latin names getting lost ardatan/graphql-mesh#259

Closed

Cito mentioned this issue Apr 10, 2021

Unicode characters get transformed into surrogate pairs by graphql.print_ast() graphql-python/graphql-core#128

Closed

leebyron mentioned this issue Apr 13, 2021

RFC: Allow full unicode range #849

Merged

dondonz mentioned this issue Jun 28, 2021

Support full unicode in parser graphql-java/graphql-java#2404

Merged

andimarek closed this as completed Jan 10, 2022

saihaj mentioned this issue Jul 7, 2023

请问字段可否支持多语言 Can the field support multiple languages? graphql/graphql-js#3933

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow full unicode range #687

allow full unicode range #687

andimarek commented Feb 8, 2020 •

edited

Loading

wonderlandpark commented Sep 5, 2020

acao commented Sep 6, 2020 •

edited

Loading

dkbarn commented Apr 12, 2021

leebyron commented Apr 13, 2021

slaratte commented Jul 1, 2021

leebyron commented Jul 1, 2021

andimarek commented Jan 10, 2022 •

edited

Loading

allow full unicode range #687

allow full unicode range #687

Comments

andimarek commented Feb 8, 2020 • edited Loading

1. Change SourceCharacter to allow also code points between 0xFFFF and 0x10FFFF (outside of the BMP):

2. Allow surrogate code pair escapes in standard quoted strings:

Previous discussion

wonderlandpark commented Sep 5, 2020

acao commented Sep 6, 2020 • edited Loading

dkbarn commented Apr 12, 2021

leebyron commented Apr 13, 2021

slaratte commented Jul 1, 2021

leebyron commented Jul 1, 2021

andimarek commented Jan 10, 2022 • edited Loading

andimarek commented Feb 8, 2020 •

edited

Loading

acao commented Sep 6, 2020 •

edited

Loading

andimarek commented Jan 10, 2022 •

edited

Loading