New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Codepoint Escape Syntax #918
Conversation
The range between U+D800 and U+DFFF should be invalid. RFC 3639
The Unicode Standard 7.0, Chapter 3
|
@masakielastic It really depends. It’s true that lone surrogates are technically invalid, and UTF-8 doesn’t allow them to be encoded, but some languages (like JavaScript and JSON) still allow lone surrogates in strings. I could imagine a PHP script that outputs JSON that includes strings containing lone surrogates in some testing scenarios (although that may be enough of an edge case to ignore it). Essentially this boils down to the question: should we use UTF-8 or WTF-8? (I don’t have a strong opinion on this.) |
I don't think we should disallow producing UTF-8-encoded surrogates, there might be legitimate applications for this. |
But if we are to allow them (it'd make CESU-8 handling easier, heh), that'd need to have a test and be in the language spec. |
var_dump("\u{FF}"); // y with diaeresis | ||
var_dump("\u{ff}"); // case-insensitive | ||
var_dump("\u{2603}"); // Unicode snowman | ||
var_dump("\u{1F602}"); // FACE WITH TEARS OF JOY emoji |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s add a test for leading zeroes? E.g. \u{0000000000001F602}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add tests to make sure that \u{+123}
and \u{-123}
are invalid. I suspect that right now the former will be accepted, while the latter will be rejected with a non-obvious error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, those are rejected as you're validating the string before calling strtol
. A test wouldn't hurt though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, I ought to use strtoul
not strtol
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, done :)
Just a bit of info ... http://graphemica.com/%F0%9F%98%82 ... we are matching Ruby. |
Ooh, I didn't know Ruby did that. I'll add that to the RFC. |
@rquadling Graphemica isn’t the most reliable source – it displays the escape sequences for JavaScript, JSON, C, C++, Java, Python incorrectly. Better link (IMHO): http://codepoints.net/U+1F602 |
Looks like Ruby supports multiple code points between the braces (e.g. |
It also supports leading zeroes, as proposed for PHP. Does ES6? |
|
Aha! :) |
e8a1885
to
0f1ecda
Compare
The spec patch now states:
|
0f1ecda
to
ce0cdf6
Compare
b3e88b6
to
bae46f3
Compare
https://wiki.php.net/rfc/unicode_escape