Unicode Codepoint Escape Syntax #918

hikari-no-yume · 2014-11-24T21:59:45Z

https://wiki.php.net/rfc/unicode_escape

masakielastic · 2014-11-25T06:17:18Z

The range between U+D800 and U+DFFF should be invalid.

RFC 3639

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

http://tools.ietf.org/html/rfc3629

The Unicode Standard 7.0, Chapter 3

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.

D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points First Byte Second Byte Third Byte Fourth Byte

U+0000..U+007F 00..7F

U+0080..U+07FF C2..DF 80..BF

U+0800..U+0FFF E0 A0..BF 80..BF

U+1000..U+CFFF E1..EC 80..BF 80..BF

U+D000..U+D7FF ED 80..9F 80..BF

U+E000..U+FFFF EE..EF 80..BF 80..BF

U+10000..U+3FFFF F0 90..BF 80..BF 80..BF

U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF

U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf

mathiasbynens · 2014-11-25T07:38:41Z

@masakielastic It really depends. It’s true that lone surrogates are technically invalid, and UTF-8 doesn’t allow them to be encoded, but some languages (like JavaScript and JSON) still allow lone surrogates in strings. I could imagine a PHP script that outputs JSON that includes strings containing lone surrogates in some testing scenarios (although that may be enough of an edge case to ignore it).

Essentially this boils down to the question: should we use UTF-8 or WTF-8? (I don’t have a strong opinion on this.)

hikari-no-yume · 2014-11-25T09:55:26Z

I don't think we should disallow producing UTF-8-encoded surrogates, there might be legitimate applications for this.

hikari-no-yume · 2014-11-25T10:41:02Z

But if we are to allow them (it'd make CESU-8 handling easier, heh), that'd need to have a test and be in the language spec.

mathiasbynens · 2014-11-27T08:57:53Z

tests/lang/string/unicode_escape.phpt

+var_dump("\u{FF}"); // y with diaeresis
+var_dump("\u{ff}"); // case-insensitive
+var_dump("\u{2603}"); // Unicode snowman
+var_dump("\u{1F602}"); // FACE WITH TEARS OF JOY emoji


Let’s add a test for leading zeroes? E.g. \u{0000000000001F602}.

Please also add tests to make sure that \u{+123} and \u{-123} are invalid. I suspect that right now the former will be accepted, while the latter will be rejected with a non-obvious error message.

Nevermind, those are rejected as you're validating the string before calling strtol. A test wouldn't hurt though :)

Now that you mention it, I ought to use strtoul not strtol.

Alright, done :)

rquadling · 2014-11-28T16:32:41Z

Just a bit of info ... http://graphemica.com/%F0%9F%98%82 ... we are matching Ruby.

hikari-no-yume · 2014-11-28T16:41:25Z

Ooh, I didn't know Ruby did that. I'll add that to the RFC.

mathiasbynens · 2014-11-28T16:53:45Z

@rquadling Graphemica isn’t the most reliable source – it displays the escape sequences for JavaScript, JSON, C, C++, Java, Python incorrectly. Better link (IMHO): http://codepoints.net/U+1F602

mathiasbynens · 2014-11-28T16:56:29Z

Looks like Ruby supports multiple code points between the braces (e.g. \u{20AC A3 A5}) so unlike ES6, it’s not exactly the same syntax.

hikari-no-yume · 2014-11-28T16:57:38Z

It also supports leading zeroes, as proposed for PHP. Does ES6?

mathiasbynens · 2014-11-28T16:58:02Z

It also supports leading zeroes, as proposed for PHP. Does ES6?

@TazeTSchnitzel Yes.

hikari-no-yume · 2014-11-28T16:58:23Z

Aha! :)

hikari-no-yume · 2014-12-16T12:14:24Z

The spec patch now states:

Implementations MUST allow Unicode codepoints that are not Unicode scalar values, such as high and low surrogates.

mathiasbynens reviewed Nov 27, 2014
View reviewed changes

smalyshev added the RFC label Nov 28, 2014

jpauli added the PHP7 label Dec 12, 2014

hikari-no-yume force-pushed the unicodeEscape branch from e8a1885 to 0f1ecda Compare December 16, 2014 07:56

hikari-no-yume force-pushed the unicodeEscape branch from 0f1ecda to ce0cdf6 Compare December 17, 2014 20:12

Unicode Codepoint Escape Syntax

bae46f3

hikari-no-yume force-pushed the unicodeEscape branch from b3e88b6 to bae46f3 Compare December 19, 2014 00:41

php-pulls merged commit bae46f3 into php:master Dec 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Codepoint Escape Syntax #918

Unicode Codepoint Escape Syntax #918

hikari-no-yume commented Nov 24, 2014

masakielastic commented Nov 25, 2014

mathiasbynens commented Nov 25, 2014

hikari-no-yume commented Nov 25, 2014

hikari-no-yume commented Nov 25, 2014

mathiasbynens Nov 27, 2014

hikari-no-yume Nov 28, 2014

nikic Dec 17, 2014

nikic Dec 17, 2014

hikari-no-yume Dec 17, 2014

hikari-no-yume Dec 17, 2014

rquadling commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

hikari-no-yume commented Dec 16, 2014

Unicode Codepoint Escape Syntax #918

Unicode Codepoint Escape Syntax #918

Conversation

hikari-no-yume commented Nov 24, 2014

masakielastic commented Nov 25, 2014

mathiasbynens commented Nov 25, 2014

hikari-no-yume commented Nov 25, 2014

hikari-no-yume commented Nov 25, 2014

mathiasbynens Nov 27, 2014

Choose a reason for hiding this comment

hikari-no-yume Nov 28, 2014

Choose a reason for hiding this comment

nikic Dec 17, 2014

Choose a reason for hiding this comment

nikic Dec 17, 2014

Choose a reason for hiding this comment

hikari-no-yume Dec 17, 2014

Choose a reason for hiding this comment

hikari-no-yume Dec 17, 2014

Choose a reason for hiding this comment

rquadling commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

mathiasbynens commented Nov 28, 2014

hikari-no-yume commented Nov 28, 2014

hikari-no-yume commented Dec 16, 2014