Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Codepoint Escape Syntax #918

Merged
merged 1 commit into from Dec 19, 2014
Merged

Conversation

hikari-no-yume
Copy link
Contributor

@masakielastic
Copy link
Contributor

The range between U+D800 and U+DFFF should be invalid.

RFC 3639

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

http://tools.ietf.org/html/rfc3629

The Unicode Standard 7.0, Chapter 3

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.

D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf

@mathiasbynens
Copy link

@masakielastic It really depends. It’s true that lone surrogates are technically invalid, and UTF-8 doesn’t allow them to be encoded, but some languages (like JavaScript and JSON) still allow lone surrogates in strings. I could imagine a PHP script that outputs JSON that includes strings containing lone surrogates in some testing scenarios (although that may be enough of an edge case to ignore it).

Essentially this boils down to the question: should we use UTF-8 or WTF-8? (I don’t have a strong opinion on this.)

@hikari-no-yume
Copy link
Contributor Author

I don't think we should disallow producing UTF-8-encoded surrogates, there might be legitimate applications for this.

@hikari-no-yume
Copy link
Contributor Author

But if we are to allow them (it'd make CESU-8 handling easier, heh), that'd need to have a test and be in the language spec.

var_dump("\u{FF}"); // y with diaeresis
var_dump("\u{ff}"); // case-insensitive
var_dump("\u{2603}"); // Unicode snowman
var_dump("\u{1F602}"); // FACE WITH TEARS OF JOY emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add a test for leading zeroes? E.g. \u{0000000000001F602}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add tests to make sure that \u{+123} and \u{-123} are invalid. I suspect that right now the former will be accepted, while the latter will be rejected with a non-obvious error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, those are rejected as you're validating the string before calling strtol. A test wouldn't hurt though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, I ought to use strtoul not strtol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, done :)

@smalyshev smalyshev added the RFC label Nov 28, 2014
@rquadling
Copy link
Contributor

Just a bit of info ... http://graphemica.com/%F0%9F%98%82 ... we are matching Ruby.

@hikari-no-yume
Copy link
Contributor Author

Ooh, I didn't know Ruby did that. I'll add that to the RFC.

@mathiasbynens
Copy link

@rquadling Graphemica isn’t the most reliable source – it displays the escape sequences for JavaScript, JSON, C, C++, Java, Python incorrectly. Better link (IMHO): http://codepoints.net/U+1F602

@mathiasbynens
Copy link

Looks like Ruby supports multiple code points between the braces (e.g. \u{20AC A3 A5}) so unlike ES6, it’s not exactly the same syntax.

@hikari-no-yume
Copy link
Contributor Author

It also supports leading zeroes, as proposed for PHP. Does ES6?

@mathiasbynens
Copy link

It also supports leading zeroes, as proposed for PHP. Does ES6?

@TazeTSchnitzel Yes.

@hikari-no-yume
Copy link
Contributor Author

Aha! :)

@hikari-no-yume
Copy link
Contributor Author

The spec patch now states:

Implementations MUST allow Unicode codepoints that are not Unicode scalar values, such as high and low surrogates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants