Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_str_split() added #3715

Closed
wants to merge 31 commits into from

Conversation

@legale
Copy link
Contributor

legale commented Dec 23, 2018

@legale legale mentioned this pull request Dec 23, 2018
@petk petk added the Enhancement label Dec 23, 2018
.gitignore Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
Co-Authored-By: legale <32621719+legale@users.noreply.github.com>
@cmb69

This comment has been minimized.

Copy link
Contributor

cmb69 commented Dec 24, 2018

Since the function is part of ext/mbstring, it should support all MB character encodings. As it is now, it only supports UTF-8.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account. It seems that this could be accomplished in userland quite simple and efficiently by using grapheme_extract().

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Dec 24, 2018

To support other encodings, it might be useful to take a look at some of the libmbfl functions, such as mbfl_strlen: https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L658

As you can see, there are basically three cases: Fixed-length encodings, where you can simply collect characters in groups of 1, 2 or 4. Then UTF-8, which is the mblen_table case and corresponds to what you implemented here, and finally the use of a mbfl_convert_filter, which handles the case of arbitrary encodings. In this case the filter will feed you characters into a callback, from which you can push into an array.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account.

I don't see the immediate use either, but I guess it would make some sense from an API parity point of view, if nothing else.

It's also worth noting that the case where you want to split by single codepoints is already available through IntlBreakIterator::createCodePointInstance(), albeit in the intl rather than mbstring extension.

@petk petk removed the Waiting on Review label Dec 24, 2018
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 25, 2018

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 25, 2018

I’m sure that there are many other ways to split multibyte string at the moment. preg_split() for example. But specialized functions usually are more efficient than not specilized. Now we’ve got str_split() but haven’t got multibyte analog.

@cmb69

This comment has been minimized.

Copy link
Contributor

cmb69 commented Dec 25, 2018

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 25, 2018

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

I'll try do add additional codepages support.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 28, 2018

New review needed.

@cmb69

This comment has been minimized.

Copy link
Contributor

cmb69 commented Dec 28, 2018

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 28, 2018

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

Thanks for your kind words. This is my first collaboration experience.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 28, 2018

Something wrong with this test on windows:
Bug #55509 (segfault on x86_64 using more than 2G memory) [C:\projects\php-src\Zend\tests\bug55509.phpt]
I compared the code that passed the test and the one that did not pass it. The only difference is data types. const char * fails the test and char * pass it. Could this be a cause?

@cmb69

This comment has been minimized.

Copy link
Contributor

cmb69 commented Dec 28, 2018

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Dec 28, 2018

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

I've changed const char * to char * and test passed.

legale added 2 commits Jan 10, 2019
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Show resolved Hide resolved
ext/mbstring/tests/mb_str_split_jp.phpt Outdated Show resolved Hide resolved
ext/mbstring/tests/mb_str_split_jp.phpt Outdated Show resolved Hide resolved
ext/mbstring/tests/mb_str_split_jp.phpt Outdated Show resolved Hide resolved
legale added 6 commits Jan 13, 2019
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Jan 14, 2019

New review required.

ext/mbstring/libmbfl/mbfl/mbfilter.h Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Jan 15, 2019

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
* | second scenario: "2- or 4-bytes width encodings UTF-16LE UTF-16BE |
* +----------------------------------------------------------------------+
*/
} else if (mbfl_encoding->flag & ( MBFL_ENCTYPE_MWC2LE | MBFL_ENCTYPE_MWC2BE )) {

This comment has been minimized.

Copy link
@nikic

nikic Jan 18, 2019

Member

If we want to have this kind of optimization, it should be part of libmbfl, rather than only this one function.

Personally, I don't think this is necessary. Thankfully UTF-16 is not an important encoding in PHP and it's not an issue if it does not have a very fast implementation.

This comment has been minimized.

Copy link
@legale

legale Jan 18, 2019

Author Contributor

I could try to implement it for the libmbfl if more experienced developers approve of this approach. I mean 2 relatively large tables of 65k each.

This comment has been minimized.

Copy link
@legale

legale Jan 18, 2019

Author Contributor

This way is at least twice faster than libmfl way.

This comment has been minimized.

Copy link
@legale

legale Jan 18, 2019

Author Contributor

Do i need to create new RFC for this?

legale added 3 commits Jan 18, 2019
…bmbfl/mbfl/ reverted to master branch commit ecd533d
@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Jan 22, 2019

i've implemented utf-16 optimization to the whole mbfl library.

@legale

This comment has been minimized.

Copy link
Contributor Author

legale commented Feb 8, 2019

@KalleZ,
@derickr,
@nikic,
@Agares,
@rybakit,
@petk,
@cmb69,
@carusogabriel,
dear colleagues, can someone of you merge this PR? Is there somethind need to be done before?

@nikic

This comment has been minimized.

Copy link
Member

nikic commented Feb 12, 2019

Closing this PR in favor of #3808.

@nikic nikic closed this Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.