Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings Escaped characters section uses {} inconsistently, unclear #2778

Closed
boxcleverliam opened this issue Sep 20, 2023 · 3 comments · Fixed by #2793
Closed

Strings Escaped characters section uses {} inconsistently, unclear #2778

boxcleverliam opened this issue Sep 20, 2023 · 3 comments · Fixed by #2793

Comments

@boxcleverliam
Copy link
Contributor

The descriptions of the escape sequences include regular expressions to describe all possible octal, hexadecimal, and Unicode patterns. Curly braces { and } are used with different meanings, making it unclear how to use these.

https://www.php.net/manual/en/language.types.string.php

For octal and hexadecimal, the curly braces { and } are part of the regular expression. They indicate the number of allowed occurrences from the preceding set. So for octal, 1 to 3 characters from the set [0-7], and for hexadecimal, 1 to 2 characters from the set [0-9A-Fa-f].

However, for Unicode it shows \u{[0-9A-Fa-f]+} . The curly braces { and } are NOT part of the regular expression. They are required as part of the sequence when it is written in the string. This is not clear, and there are also no examples on that page, even in the comments.

I think we should clarify which part is the regular expression by repeating it in the description, and give examples of each.

My suggestion:

  • Octal: the sequence of characters matching the regular expression [0-7]{1,3} is a character in octal notation (e.g. "\101" === "A"), which silently overflows to fit in a byte (e.g. "\400" === "\000")
  • Hexadecimal: the sequence of characters matching the regular expression [0-9A-Fa-f]{1,2} is a character in hexadecimal notation (e.g. "\x41" === "A")
  • Unicode: the sequence of characters matching the regular expression[0-9A-Fa-f]+ is a Unicode codepoint, which will be output to the string as that codepoint's UTF-8 representation. The braces are required in the sequence. E.g. "\u{41}" === "A"
@Girgias
Copy link
Member

Girgias commented Sep 22, 2023

Providing a PR with the suggestion would make it easier to see exactly what you want.

While the suggestion seems sensible, I have no idea how you would want to implement/render it right now.

@damianwadley
Copy link
Member

While the suggestion seems sensible, I have no idea how you would want to implement/render it right now.

I figure either we (a) go straight PCRE and escape the literal {s, which would be a very easy thing to do to close this out, or (b) consider that many people aren't particularly well versed in regular expressions and so adopt a more human-friendly ABNF-style syntax instead.

The latter being, of course, a little more involved: the docs use regexes for lots of these things (not that they'd all have to be fixed at once), nevermind that it should probably get a discussion and consensus before someone dives into changing everything. But that's where my vote would go.

@boxcleverliam
Copy link
Contributor Author

@Girgias I made a pull request here. #2793
I haven't worked with this file format before, so I hope that this is useful.

@damianwadley Maybe there is value in improving this piece of documentation on its own as it is an early topic in learning the language. There may be a more human-friendly syntax for these sequences, but I believe that some examples would show it best. Perhaps a fuller example like this:

echo "The \"banknote\" emoji\n\t\u{1F4B5}\n has a \$ symbol on it.";
// Output:
//The "banknote" emoji
//	💵
// has a $ symbol on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants