-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Major overhaul of mbstring (part 26) #9592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In case they are interested... @Girgias @kocsismate |
ext/mbstring/php_unicode.c
Outdated
}; | ||
|
||
static int convert_case_filter(int c, void *void_data) | ||
MBSTRING_API zend_string *php_unicode_convert_case(int case_mode, const char *srcstr, size_t in_len, const mbfl_encoding *src_encoding, int illegal_mode, int illegal_substchar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case mode, would it make sense to add an enum which contains all of the different modes?
And are the illegal_mode
and illegal_substchar
really ints? And maybe not bool, and a char? (Again I know very little about character encodings so the current API might be sensible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an enum
for case mode is a good idea.
illegal_mode
is essentially an enum
; there are 3-4 valid values. illegal_substchar
is a Unicode codepoint, and therefore could be uint32_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It seems that introducing an enum
for illegal_mode
may require reorganizing the header files for mbstring. Depending on which header file I define it in, there are a lot of issues with the order of header file inclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just converting illegal_substchar
to uint32_t
.
For illegal_mode
, I would like to handle this in the future. It seems to open a "can of worms" which is not directly related to this PR.
Hi Alex, I am doing some light testing and I cannot figure out why I get different outputs for Element 0 is input codepoint, element 2 is output codepoint.
Edit: I am running PHP 8.0 and while writing this, I realized I haven't compared it against 8.1. https://3v4l.org/M9sqM So it's not this PR. |
@kamil-tekiela Thank you very much for testing! Could you share your test code? |
I only did a simple loop |
@kamil-tekiela Then |
It says UTF-8 |
@kamil-tekiela I am trying to investigate your findings, but am not getting the same results. Could you kindly share more of your testing code? |
As an example:
gives me:
|
@kamil-tekiela Sorry, I just saw the link you kindly provided to |
Hmm. U+2C5F is Glagolitic Small Letter Caudate Chrivi, U+2C2F is Glagolitic Capital Letter Caudate Chivi. So it looks like PHP 8.1 is doing the right thing there. Looks like this difference must be caused by fe36b81. |
Just pushed 2 commits based on the helpful comments from @Girgias. Any further comments? Thank you, everyone! |
Failure on Travis is spurious (it's on a test for |
From my PoV, this looks OK, but as I said far from an expert :) |
…mb_strtolower Speed increase is only about 50% for title casing, but 2-3x for other forms of case conversion.
This value is a wchar, so the best type for it is uint32_t.
c7b8cf7
to
beae80b
Compare
Merged. If anyone notices anything that can be improved later on, I can still revise these changes. |
Use the new (faster) encoding conversion code for case conversion functions like
mb_convert_case
,mb_strtoupper
, andmb_strtolower
. Speed increase is only about 50% for title casing, but 2-3x for other types of case conversion.Fuzzed with libfuzzer. One bug in my first draft of the implementation was found, and a regression test added.
Note: the signature of one function with public symbol (
php_unicode_convert_case
) is changed. This could break C extensions which link directly to mbstring and call this function. However, none of the PECL extensions do so.FYA @cmb69 @nikic @kamil-tekiela
Perhaps @mvorisek might be interested. Recently he raised some suggestions about how to make
mb_strtoupper
andmb_strtolower
faster. This PR does not close the performance gap withstrtoupper
andstrtolower
, but at least makes it much smaller than it was.