-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5.5] Add multibyte functions where needed in Support/Str #21207
[5.5] Add multibyte functions where needed in Support/Str #21207
Conversation
This is a breaking change. https://3v4l.org/3fMUV |
Multibyte strings weren’t being correctly handled by these methods before this change, so I look at it like a bug fix rather than breaking change. |
Thank you, especially for the tests that demonstrate the usefulness of the multibyte functions. |
These new tests worked with the old implementation too. They do not provide anything in the ways of regression testing or avoid anyone going back to the old strlen/substr calls. |
Ah, worth having a deeper look at it, then. Iff it's not possible to make the single-byte functions to fail, the use of the multi-byte functions is not justified nor advisable. |
Hmm, yeah. I had written the test case strings to prove negative against the underlying If no one can make this fail with the current implementation (I can't find a failing example string), then we can hold this off until 5.6 if @taylorotwell wants. Thanks for the awesome help @sisve, et al., you all continue making the framework amazing! |
If no one can figure out a way to make the single-byte functions to fail (with proper strings, not with crafted bytes as in sisve's example), then this PR should be reverted, as both code changes and added tests are technically useless, and may give a false impression of added robustness. I'm not an expert at UTF-8, but apparently it just cannot overlap: |
Agreed. I’ll start a thread in internals, this was only my first class to audit for multibyte support on these kinds of string functions, so I’m going to be looking more closely at how they’re used and when they might break across the framework to help shore up any weaknesses with respect to multibyte strings. Thanks all for the help. |
Fun fact (I have to make knowing these things sound fun since I spend way to much time on it ...); the native functions (strlen) count bytes and the multibyte functions (mb_strlen) counts code points, assuming a unicode encoding, while people usually describe what the grapheme functions (grapheme_strlen) does. Things like "How many characters is in $input?" Well, still assuming unicode; there are many ways to write the same character. A simple answer is that it could be "the character Ö", or it could be "the character O followed with a character two-dots-on-top-of-the-previous-character". These things are easiest handled by a normalization middleware that transforms all your incoming strings into the way you expect them to be. The character "Ö", in normalization mode NFD, becomes the unicode characters 0x4F and 0x308. That is "LATIN CAPITAL LETTER O" followed by "COMBINING DIAERESIS". The utf-8 representation of this becomes
Output:
Thus shown that the mb_strlen counts code points, not characters. And grapheme_strlen is counting graphemes, which is closer to counting characters than counting code points, but not really the same thing, and sometime around here the Unicode Monster comes for you. Also, that COMBINING DIAERESIS can be copy/pasted. Perfect if you're b̈ö̈r̈̈̈ë̈̈̈d̈̈̈̈. |
Any news on this? Does everyone agree this PR can and should be reverted? It would be great to not slow down these Str methods purposelessly. |
Was this not revererted in another PR? |
I don’t think it was ever reverted, but #21722 is doing that now. |
This reverts #21207. Parsing with the multibyte functions is actually useless.
This PR updates string support functions to use multibyte-safe versions of
substr
andstrlen
.Additional test assertions are also added to validate multibyte string operations.
The four methods revised are:
Str::endsWith
Str::replaceFirst
Str::replaceLast
Str::startsWith
Let me know if there are any questions. Thanks!