Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preg split not splitting some unicodes #665

Open
viraj-bookanna opened this issue Jun 7, 2021 · 4 comments
Open

Preg split not splitting some unicodes #665

viraj-bookanna opened this issue Jun 7, 2021 · 4 comments
Labels
enhancement New feature or request Extension: pcre

Comments

@viraj-bookanna
Copy link

From manual page: https://php.net/function.preg-split


These json encoded unicode characters in a string not splitted by the method
\ud876\ude54
preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY);

@kamil-tekiela
Copy link
Member

These are Unicode surrogate code points. They don't correspond to a valid Unicode character. This is mojibake. You can't split them meaningfully into characters if they are not characters in the first place. What exactly is your expectation here?

@cmb69
Copy link
Contributor

cmb69 commented Jun 7, 2021

From the PCRE docs:

In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum # 9 makes it clear that they should not be.

Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)

We may want to document, that surrogates are not supported; converting to Utf-8 first may yield the desired result.

@viraj-bookanna
Copy link
Author

viraj-bookanna commented Jun 7, 2021 via email

@cmb69
Copy link
Contributor

cmb69 commented Jun 7, 2021

Java works with UTF-16, PHP' PCRE with UTF-8.

@Girgias Girgias added enhancement New feature or request Extension: pcre labels Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Extension: pcre
Projects
None yet
Development

No branches or pull requests

4 participants