-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow full unicode range #687
Comments
Any Updates for it? I really need this |
@wonderlandpark given it's current stage, it warrants further discussion. If you want to advocate for this, you can get together with other proposal authors/advocates, ask if there are ways to help advance it, and most importantly, propose adding it to the next graphql working group agenda and attend the next monthly working group meeting! |
One point that I'm still confused about is the use of surrogate pairs for unicode characters outside the Basic Multilingual Plane. According to the JSON spec, it already encodes such characters using UTF-16 surrogates. It seems that it does this for historical reasons, for compatibility with older implementations of Javascript. (See: https://stackoverflow.com/a/38552626/6686740 ) If GraphQL is using JSON to serialize string literals, then it's already producing surrogates for characters outside the BMP. So in what respects is GraphQL not yet compatible with these characters? An issue that I've hit with the Python graphql-core (see linked ticket above), is that Is this correct behavior? Is the idea that the GraphQL implementation itself is supposed to interpret the escaped characters produced by JSON serialization? Or should the JSON library be used in both directions? |
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well.
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
The serialized data likely already is supporting non-BMP characters since it's typically directly translating internal string values to JSON. The proposal here is for strings within the actual GraphQL document text, for example: |
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
Hello there, I don't if this the right place, but I'm facing an issue with the current version of library where special characters are not being properly encoded, my application receives a lot of Spanish strings with ñ or other accented characters and those get lost between the client and the API. I just want to know if that would solve my issue. Thanks |
@slaratte that seems unrelated to this spec advancement. I'd check with an issue on whatever specific software or framework you happen to be using. It sounds like bad content encoding somewhere in your stack |
This is actually implemented in the spec now. |
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!) Co-authored-by: Andreas Marek <andimarek@fastmail.fm>
These are my proposed changes to the spec to allow for full unicode range (currently it is restricted to BMP code points. See SourceCharacter)
1. Change SourceCharacter to allow also code points between 0xFFFF and 0x10FFFF (outside of the BMP):
This does not cover all unicode code points: most of the Control Characters are not allowed. This is the same behavior as now and I don't see a reason to change it: the only places where Control Characters could be allowed are inside comments or String literals. Inside Strings you can escape them and inside comments they don't really make sense or you can easily work around it.
Changing it to allow for Control Characters to be included would also add an additional burden on systems processing GraphQL documents. Most importantly JSON also requires Control Characters to be escaped (https://tools.ietf.org/html/rfc8259#section-7).
2. Allow surrogate code pair escapes in standard quoted strings:
Currently standard quoted strings allow for BMP code points to be escaped. (Via
\u<4-digit-hex-value>
.) In order to align this with theSourceCharacter
change above the spec should allow also code points outside of the BMP to be escaped. Surrogate Pairs are the most direct way to allow for that.For example the unicode code point U+1F37A ( 🍺 ) which is outside of the BMP can be escaped as
\ud83c\udf7a
There are other escapes sequences used for code points outside of the BMP. For example JS and others allow for
\u{1F37A}
. But this would introduce a new syntax. I argue that surrogate code pairs are the most compatible and simplest option. JSON for example understands surrogate code pairs but not\u{1F37A}
.One small open question is how illegal surrogate pairs should be handled:
For example
\ud83c\u0020
or\uDEAD
is such an illegal pair.The JS spec says:
The JSON spec notes:
I would recommend to add a section that servers should try to reject illegal surrogate pairs if possible in order to avoid unexpected behavior.
Previous discussion
Previous Issue about it: #214
Previous PR which was not merged: #231
Please comment, leave feedback.
The JS PR for this change is here: graphql/graphql-js#2449
The text was updated successfully, but these errors were encountered: