-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Performance optimizations for mb_strlen and encoding conversion of SJIS, UHC, CP936, BIG5 text #10211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@alexdowad I have a question. |
@youkidearitai This is a good point! Should I explain the reason for the difference? |
I would be graceful if you could explain this difference. |
@youkidearitai OK. Just to be clear, the reason for asking whether I should explain is because I don't know what the reviewers already understand, and didn't want to launch into a long explanation of what everyone already knows. We can start from here: php-src/ext/mbstring/libmbfl/filters/mbfilter_sjis.c Lines 64 to 81 in e7a0a2b
This is the When
On the other hand, when
Hmm. Have you noticed something here? ... Look at the legacy code for converting SJIS to codepoints. What does it do with byte php-src/ext/mbstring/libmbfl/filters/mbfilter_sjis.c Lines 301 to 317 in e7a0a2b
Both the legacy SJIS decoder and the new one have never treated @youkidearitai You discovered a pre-existing bug in mbstring! Nice job. I believe we can find more sample strings where the output before and after this PR will be different, but first let me add a commit which fixes the |
Hmm, I see another reason why |
One thing I think is becoming clearer to me... it may be a bad idea for SJIS, MacJapanese, SJIS-{DOCOMO,KDDI,SoftBank}, and SJIS-2004 to all share the same |
OK, I have pushed a fix for the issue with byte Bytes In SJIS-2004 and the mobile variants, bytes |
Before I start deep diving the commit history, if the last one is an actual bug fix, it may need to be backported, or if not have an entry in UPGRADING. So could you please split that specific one into it's own PR? |
Ah, good point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 2: seems reasonable but I just don't know what a "PUA code" actually is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 3 (GB18030 to Unicode):
Looks like black magic to me, but if tests pass LGTM
I know this is existing code, and me being opinionated, but flipping the conditions to detect error states and continue like it is done in the previous encoding conversion would make it IMHO more readable.
Private User Area. Unicode codepoint numbers which are 'reserved' and which the Unicode Consortium will never use for any purpose, not because they are 'illegal' or 'invalid' like U+D800-U+DFFF, but so that proprietary software systems can use them internally with their own internal meaning. Similar concept to "private" IP address ranges like 10.0.0.0/8, which are not used on the public Internet. I think lots of other protocols include "private" identifier ranges as well. Some of the legacy text encodings supported by mbstring include characters which are not in Unicode. In some cases, mbstring maps these to PUA codepoints. I don't know all the history behind this. Maybe the original author of mbstring did this because some legacy software systems actually used those PUA codepoints to represent those particular characters. Or maybe it was just done to facilitate conversions between non-Unicode charsets which include those characters. |
Sorry, which lines are you referring to for GB18030 conversion? |
Just running tests locally before pushing an update... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits, 4,5 and 7 are definitely okay.
The other ones I'm still on the fence about the compromise of readability (especially the last BIG5 one for 1-2%)
😆 Feel free to veto it, I'll delete it from the patch series. |
So looks like I have a bit of homework here... • Look into (perhaps) making other adjustments to the SJIS |
I have merged the commits which @Girgias judged as "definitely okay". The fix for the SJIS table has been removed from this PR; I'll open another one soon with that fix. |
The first 3 commits are also good to land for me. The 4th one which optimizes the out of bound check for UHC is very much worth the trade-off IMHO, so I think that can also land. The 5th and 6th ones which give a 9% improvement also sound worth the trade-off. For the SJIS may I be annoying and ask what is the performance gain from the changes which don't affect the out of bound check? As compared to the other encodings, it seems it was all done together. The last commit is similar to other ones and is very readable now that I know more what's going. Basically, I think all can land :-) |
Thanks for the great review! I have landed a few more commits. Still holding on to the change to |
Rebased on top of #10230. @youkidearitai Do you have any intention of testing this change more to see if any other interesting test strings will be found? |
Hmm, I just discovered another reason to use the decoding filters for php-src/ext/mbstring/libmbfl/filters/mbfilter_sjis.c Lines 1231 to 1237 in 3ab72a4
Note that byte Using the decoding filters, if a single byte converts to 2 codepoints, then I guess there's a question here of whether ...But I think that was already resolved a long time ago. For the majority of legacy text encodings, For a few text encodings, like MacJapanese, it would return character length in a different charset and not in codepoints. This was yet another weird inconsistency, which will be gone after this PR is merged. When I get around to working on the documentation, I will have to clarify the |
Hmm, just thinking about this some more. I am really wondering if we should remove the The reason is this. If this PR is merged, The fact the MacJapanese can sometimes convert 1 byte to 2 codepoints, or 2 bytes to 3/4/5 codepoints (which is not expressed by the Hmm. |
Lesson for library writers here: When defining your new library's API, be very specific about what the output of each function is supposed to be. Much of the mbstring documentation is quite vague. |
After some more thought, I have decided to leave the MacJapanese implementation "as is" for now. Yes, the behavior of I might add a couple of unit tests for this though. |
For now, I can't think of anything else. |
…y a slow path) Various mbstring legacy text encodings have what is called an 'mblen_table'; a table which gives the length of a multi-byte character using a lookup on the first byte value. Several mbstring functions have a 'fast path' which uses this table when it is available. However, it turns out that iterating through a string using the mblen_table is surprisingly slow. I found that by deleting this 'fast path' from mb_strlen, while mb_strlen becomes a few percent slower on very small strings (0-5 bytes), very large performance gains can be achieved on medium to long input strings. Part of the reason for this is because our text decoding filters are so much faster now. Here are some benchmarks: EUC-KR, short (0-5 chars) - master faster by 11.90% (0.0000 vs 0.0000) EUC-JP, short (0-5 chars) - master faster by 10.88% (0.0000 vs 0.0000) BIG-5, short (0-5 chars) - master faster by 10.66% (0.0000 vs 0.0000) UTF-8, short (0-5 chars) - master faster by 8.91% (0.0000 vs 0.0000) CP936, short (0-5 chars) - master faster by 6.27% (0.0000 vs 0.0000) UHC, short (0-5 chars) - master faster by 5.38% (0.0000 vs 0.0000) SJIS, short (0-5 chars) - master faster by 5.20% (0.0000 vs 0.0000) UTF-8, medium (~100 chars) - new faster by 127.51% (0.0004 vs 0.0002) UTF-8, long (~10000 chars) - new faster by 87.94% (0.0319 vs 0.0170) UTF-8, very long (~100000 chars) - new faster by 88.25% (0.3199 vs 0.1699) SJIS, medium (~100 chars) - new faster by 208.89% (0.0004 vs 0.0001) SJIS, long (~10000 chars) - new faster by 253.57% (0.0319 vs 0.0090) CP936, medium (~100 chars) - new faster by 126.08% (0.0004 vs 0.0002) CP936, long (~10000 chars) - new faster by 200.48% (0.0319 vs 0.0106) EUC-KR, medium (~100 chars) - new faster by 146.71% (0.0004 vs 0.0002) EUC-KR, long (~10000 chars) - new faster by 212.05% (0.0319 vs 0.0102) EUC-JP, medium (~100 chars) - new faster by 186.68% (0.0004 vs 0.0001) EUC-JP, long (~10000 chars) - new faster by 295.37% (0.0320 vs 0.0081) BIG-5, medium (~100 chars) - new faster by 173.07% (0.0004 vs 0.0001) BIG-5, long (~10000 chars) - new faster by 269.19% (0.0319 vs 0.0086) UHC, medium (~100 chars) - new faster by 196.99% (0.0004 vs 0.0001) UHC, long (~10000 chars) - new faster by 256.39% (0.0323 vs 0.0091) This does raise the question: is using the 'mblen_table' worthwhile for other mbstring functions, such as mb_str_split? The answer is yes, it is worthwhile; you see, while mb_strlen only needs to decode the input string but not re-encode it, when mb_str_split is implemented using the conversion filters, it needs to both decode the string and then re-encode it. This means that there is more potential to gain performance by using the 'mblen_table'. Benchmarking shows that in a few cases, mb_str_split becomes faster when the 'mblen_table fast path' is deleted, but in the majority of cases, it becomes slower.
MacJapanese has a somewhat unusual feature that when mapped to Unicode, many characters map to sequences of several codepoints. Add test cases demonstrating how mb_str_split and mb_substr behave in this situation. When adding these tests, I found the behavior of mb_substr was wrong due to an inconsistency between the string "length" as measured by mb_strlen and the number of native MacJapanese characters which mb_substr would count when iterating over the string using the mblen_table. This has been fixed. I believe that mb_strstr will also return wrong results in some cases for MacJapanese. I still need to come up with unit tests which demonstrate the problem and figure out how to fix it.
Please see the added commit; I added some more unit tests for |
Also if you found a bug it probably should be backported |
Well, the bug only applies to this branch. Older versions of PHP should be fine. |
ACK |
Thanks, everyone. |
The reviewers will notice one optimization is applied here to the decoding routines for about 3 different legacy text encodings. It involves decrementing the pointer which marks the end of the input string:
And then fixing this up before exiting:
...the point of all this being that we can remove one
p < e
guard from the body of the main loop. If a lot of work is being done in the main loop, this might not make a noticeable difference, but if the main loop is already very tight, optimizing out just one such check can make it close to 10% faster.Here is what I would love to hear some feedback on... doing this little "dance" to optimize out the
p < e
guard is not really necessary when the input string is azend_string
, becausezend_string
s always have one extra byte allocated after the end of the string content (used for a null terminator).Actually, in a lot of our legacy text decoding routines, we could use the
zend_string
null terminator as a sentinel to optimize out even more checks and gain more performance.The problem is that these text decoding routines are not only used for
mb_convert_encoding
but also by the PHP interpreter's scanner when reading in PHP scripts which are written in some funny text encoding... right here:php-src/Zend/zend_language_scanner.l
Lines 137 to 161 in 3b5072f
I don't know if I can rely on these input strings being null-terminated, but I guess the answer is probably no.
Any thoughts will be appreciated.
@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai