Skip to content

Performance optimizations for mb_strlen and encoding conversion of SJIS, UHC, CP936, BIG5 text #10211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexdowad
Copy link
Contributor

The reviewers will notice one optimization is applied here to the decoding routines for about 3 different legacy text encodings. It involves decrementing the pointer which marks the end of the input string:

e--; /* Stop the main loop 1 byte short of the end of the input */

And then fixing this up before exiting:

/* Finish up last byte of input string if there is one */
if (p == e && out < limit) {
	unsigned char c = *p++;
	*out++ = (c < 0x80) ? c : MBFL_BAD_INPUT;
}

*in_len = e - p + 1;

...the point of all this being that we can remove one p < e guard from the body of the main loop. If a lot of work is being done in the main loop, this might not make a noticeable difference, but if the main loop is already very tight, optimizing out just one such check can make it close to 10% faster.

Here is what I would love to hear some feedback on... doing this little "dance" to optimize out the p < e guard is not really necessary when the input string is a zend_string, because zend_strings always have one extra byte allocated after the end of the string content (used for a null terminator).

Actually, in a lot of our legacy text decoding routines, we could use the zend_string null terminator as a sentinel to optimize out even more checks and gain more performance.

The problem is that these text decoding routines are not only used for mb_convert_encoding but also by the PHP interpreter's scanner when reading in PHP scripts which are written in some funny text encoding... right here:

static size_t encoding_filter_script_to_internal(unsigned char **to, size_t *to_length, const unsigned char *from, size_t from_length)
{
const zend_encoding *internal_encoding = zend_multibyte_get_internal_encoding();
ZEND_ASSERT(internal_encoding);
return zend_multibyte_encoding_converter(to, to_length, from, from_length, internal_encoding, LANG_SCNG(script_encoding));
}
static size_t encoding_filter_script_to_intermediate(unsigned char **to, size_t *to_length, const unsigned char *from, size_t from_length)
{
return zend_multibyte_encoding_converter(to, to_length, from, from_length, zend_multibyte_encoding_utf8, LANG_SCNG(script_encoding));
}
static size_t encoding_filter_intermediate_to_script(unsigned char **to, size_t *to_length, const unsigned char *from, size_t from_length)
{
return zend_multibyte_encoding_converter(to, to_length, from, from_length,
LANG_SCNG(script_encoding), zend_multibyte_encoding_utf8);
}
static size_t encoding_filter_intermediate_to_internal(unsigned char **to, size_t *to_length, const unsigned char *from, size_t from_length)
{
const zend_encoding *internal_encoding = zend_multibyte_get_internal_encoding();
ZEND_ASSERT(internal_encoding);
return zend_multibyte_encoding_converter(to, to_length, from, from_length,
internal_encoding, zend_multibyte_encoding_utf8);
}

I don't know if I can rely on these input strings being null-terminated, but I guess the answer is probably no.

Any thoughts will be appreciated.

@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

@youkidearitai
Copy link
Contributor

youkidearitai commented Jan 3, 2023

@alexdowad I have a question. mb_strlen(hex2bin("80a0a1ef"), "SJIS"); result is this PR and PHP 8.x seems differs. 3v4l https://3v4l.org/gJhln outputs 3 but this PR is outputs 4. "80a0a1ef" is wrong string, but contains 0xa1 is valid of Shift_JIS. What do you think?

@alexdowad
Copy link
Contributor Author

@youkidearitai This is a good point!

Should I explain the reason for the difference?

@youkidearitai
Copy link
Contributor

Should I explain the reason for the difference?

I would be graceful if you could explain this difference.

@alexdowad
Copy link
Contributor Author

alexdowad commented Jan 4, 2023

@youkidearitai OK. Just to be clear, the reason for asking whether I should explain is because I don't know what the reviewers already understand, and didn't want to launch into a long explanation of what everyone already knows.

We can start from here:

const unsigned char mblen_table_sjis[] = { /* 0x80-0x9f,0xE0-0xFF */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
};

This is the mblen_table for SJIS. To get the length of a valid SJIS character, whether single-byte or multi-byte, take the first byte value and use it as an index into this table.

When mb_strlen is implemented using the mblen_table, it does this on the sample string which you provided:

  1. Get the first byte of the string: 0x80. Get mblen_table[0x80]: it's 2. That means the first character is 2 bytes long. Jump forward 2 bytes.
  2. Get the 3rd byte of the string: 0xA1. Get mblen_table[0xA1]: it's 1. That means the second character is 1 byte long. Jump forward 1 byte.
  3. Get the 4th byte of the string: 0xEF. Get mblen_table[0xEF]: it's 1. That means the 3rd character is 1 byte long. Jump forward 1 byte.
  4. We are now past the end of the string, so we stop. We counted 3 characters, so return 3.

On the other hand, when mb_strlen is implemented using the decoding filters, it does this:

  1. Convert 0x80A0A1EF to Unicode codepoints. If the mb_substitute_character is U+003F, we get U+003F U+003F U+FF61 U+003F.
  2. Return the number of codepoints, which is 4.

Hmm. Have you noticed something here?

...

Look at the legacy code for converting SJIS to codepoints. What does it do with byte 0x80?

int mbfl_filt_conv_sjis_wchar(int c, mbfl_convert_filter *filter)
{
int s1, s2, w;
switch (filter->status) {
case 0:
if (c >= 0 && c < 0x80) { /* ASCII */
CK((*filter->output_function)(c, filter->data));
} else if (c > 0xA0 && c < 0xE0) { /* Kana */
CK((*filter->output_function)(0xFEC0 + c, filter->data));
} else if (c > 0x80 && c < 0xF0 && c != 0xA0) { /* Kanji, first byte */
filter->status = 1;
filter->cache = c;
} else {
CK((*filter->output_function)(MBFL_BAD_INPUT, filter->data));
}
break;

Both the legacy SJIS decoder and the new one have never treated 0x80 as the starting byte for a 2-byte character. It has always been treated as a single erroneous byte. That means... the mblen_table is wrong to have 2 in position 0x80!

@youkidearitai You discovered a pre-existing bug in mbstring! Nice job.

I believe we can find more sample strings where the output before and after this PR will be different, but first let me add a commit which fixes the mblen_table.

@alexdowad
Copy link
Contributor Author

Hmm, I see another reason why sjis_mblen_table[0x80] == 2 is wrong; For MacJapanese, we are using the same mblen_table. In MacJapanese, 0x80 is not an erroneous byte, but is a valid one-byte character (not 2-byte).

@alexdowad
Copy link
Contributor Author

One thing I think is becoming clearer to me... it may be a bad idea for SJIS, MacJapanese, SJIS-{DOCOMO,KDDI,SoftBank}, and SJIS-2004 to all share the same mblen_table.

@alexdowad
Copy link
Contributor Author

OK, I have pushed a fix for the issue with byte 0x80. I am still thinking about whether to adjust other entries in the mblen_table for SJIS.

Bytes 0xF0-0xFF are not treated as valid for starting a 2-byte character in SJIS, so perhaps their entries should also be set to 1. MacJapanese is similar, but in MacJapanese, 0xFD-0xFF are valid 1-byte characters, so again, their entries should be 1.

In SJIS-2004 and the mobile variants, bytes 0xF0-0xFC are used to start 2-byte characters. But 0xFD-0xFF are not.

@Girgias
Copy link
Member

Girgias commented Jan 4, 2023

Before I start deep diving the commit history, if the last one is an actual bug fix, it may need to be backported, or if not have an entry in UPGRADING. So could you please split that specific one into it's own PR?

@alexdowad
Copy link
Contributor Author

Before I start deep diving the commit history, if the last one is an actual bug fix, it may need to be backported, or if not have an entry in UPGRADING. So could you please split that specific one into it's own PR?

Ah, good point.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 2: seems reasonable but I just don't know what a "PUA code" actually is.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 3 (GB18030 to Unicode):
Looks like black magic to me, but if tests pass LGTM

I know this is existing code, and me being opinionated, but flipping the conditions to detect error states and continue like it is done in the previous encoding conversion would make it IMHO more readable.

@alexdowad
Copy link
Contributor Author

Commit 2: seems reasonable but I just don't know what a "PUA code" actually is.

Private User Area. Unicode codepoint numbers which are 'reserved' and which the Unicode Consortium will never use for any purpose, not because they are 'illegal' or 'invalid' like U+D800-U+DFFF, but so that proprietary software systems can use them internally with their own internal meaning.

Similar concept to "private" IP address ranges like 10.0.0.0/8, which are not used on the public Internet. I think lots of other protocols include "private" identifier ranges as well.

Some of the legacy text encodings supported by mbstring include characters which are not in Unicode. In some cases, mbstring maps these to PUA codepoints. I don't know all the history behind this. Maybe the original author of mbstring did this because some legacy software systems actually used those PUA codepoints to represent those particular characters. Or maybe it was just done to facilitate conversions between non-Unicode charsets which include those characters.

@alexdowad
Copy link
Contributor Author

Commit 3 (GB18030 to Unicode): Looks like black magic to me, but if tests pass LGTM

I know this is existing code, and me being opinionated, but flipping the conditions to detect error states and continue like it is done in the previous encoding conversion would make it IMHO more readable.

Sorry, which lines are you referring to for GB18030 conversion?

@alexdowad
Copy link
Contributor Author

Just running tests locally before pushing an update...

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits, 4,5 and 7 are definitely okay.

The other ones I'm still on the fence about the compromise of readability (especially the last BIG5 one for 1-2%)

@alexdowad
Copy link
Contributor Author

Commits, 4,5 and 7 are definitely okay.

The other ones I'm still on the fence about the compromise of readability (especially the last BIG5 one for 1-2%)

😆 Feel free to veto it, I'll delete it from the patch series.

@alexdowad
Copy link
Contributor Author

So looks like I have a bit of homework here...

• Look into (perhaps) making other adjustments to the SJIS mblen_table
• Break those commits out into a separate PR, targeting PHP-8.1
• Experiment to see whether strings passed in from the lexer are null-terminated or not
• If they are, do away with the e-- nonsense

@alexdowad
Copy link
Contributor Author

I have merged the commits which @Girgias judged as "definitely okay".

The fix for the SJIS table has been removed from this PR; I'll open another one soon with that fix.

@Girgias
Copy link
Member

Girgias commented Jan 4, 2023

The first 3 commits are also good to land for me.

The 4th one which optimizes the out of bound check for UHC is very much worth the trade-off IMHO, so I think that can also land.

The 5th and 6th ones which give a 9% improvement also sound worth the trade-off.

For the SJIS may I be annoying and ask what is the performance gain from the changes which don't affect the out of bound check? As compared to the other encodings, it seems it was all done together.

The last commit is similar to other ones and is very readable now that I know more what's going.

Basically, I think all can land :-)

@alexdowad
Copy link
Contributor Author

Thanks for the great review! I have landed a few more commits.

Still holding on to the change to mb_strlen a bit more... I am curious to see if @youkidearitai will come up with anything else.

@alexdowad
Copy link
Contributor Author

Rebased on top of #10230.

@youkidearitai Do you have any intention of testing this change more to see if any other interesting test strings will be found?

@alexdowad
Copy link
Contributor Author

Hmm, I just discovered another reason to use the decoding filters for mb_strlen instead of the mblen_table...

} else if (c == 0xFF) {
if ((limit - out) < 2) {
p--;
break;
}
*out++ = 0x2026;
*out++ = 0xF87F;

Note that byte 0xFF in MacJapanese converts to 2 codepoints, not one. Actually, now that I mention it, there are some 2-byte characters in MacJapanese which convert to as many as 5 codepoints.

Using the decoding filters, if a single byte converts to 2 codepoints, then mb_strlen will say the "length" of the string is 2. There's no way to do that using the mblen_table; using the mblen_table, mb_strlen would have said that the "length" is 1.

I guess there's a question here of whether mb_strlen should count string length in Unicode codepoints, or in the target encoding's native charset (which might contain characters which don't exist in Unicode).

...But I think that was already resolved a long time ago. For the majority of legacy text encodings, mb_strlen has always returned the length in Unicode codepoints.

For a few text encodings, like MacJapanese, it would return character length in a different charset and not in codepoints. This was yet another weird inconsistency, which will be gone after this PR is merged.

When I get around to working on the documentation, I will have to clarify the mb_strlen documentation on this point. Right now it's not very clear about what "character length" the function actually returns.

@alexdowad
Copy link
Contributor Author

Hmm, just thinking about this some more.

I am really wondering if we should remove the mblen_table for MacJapanese.

The reason is this. If this PR is merged, mb_strlen will no longer be affected by the mblen_table. But mb_str_split, mb_substr, and a couple of other functions are.

The fact the MacJapanese can sometimes convert 1 byte to 2 codepoints, or 2 bytes to 3/4/5 codepoints (which is not expressed by the mblen_table) means that it is possible for mb_str_split or mb_substr to return funny results when such special byte sequences appear in the input string.

Hmm.

@alexdowad
Copy link
Contributor Author

Lesson for library writers here: When defining your new library's API, be very specific about what the output of each function is supposed to be.

Much of the mbstring documentation is quite vague.

@alexdowad
Copy link
Contributor Author

Hmm, just thinking about this some more.

I am really wondering if we should remove the mblen_table for MacJapanese.

After some more thought, I have decided to leave the MacJapanese implementation "as is" for now. Yes, the behavior of mb_str_split and mb_substr are a bit idiosyncratic for MacJapanese, but I'm not 100% sure what the "right" behavior is in this case. So it seems like the best thing to do at the moment is to leave the current behavior in place.

I might add a couple of unit tests for this though.

@youkidearitai
Copy link
Contributor

I am curious to see if @youkidearitai will come up with anything else.

For now, I can't think of anything else.

…y a slow path)

Various mbstring legacy text encodings have what is called an 'mblen_table';
a table which gives the length of a multi-byte character using a lookup on
the first byte value. Several mbstring functions have a 'fast path' which uses
this table when it is available.

However, it turns out that iterating through a string using the mblen_table
is surprisingly slow. I found that by deleting this 'fast path' from mb_strlen,
while mb_strlen becomes a few percent slower on very small strings (0-5 bytes),
very large performance gains can be achieved on medium to long input strings.

Part of the reason for this is because our text decoding filters are so much
faster now.

Here are some benchmarks:

    EUC-KR, short (0-5 chars)        - master faster by 11.90% (0.0000 vs 0.0000)
    EUC-JP, short (0-5 chars)        - master faster by 10.88% (0.0000 vs 0.0000)
    BIG-5, short (0-5 chars)         - master faster by 10.66% (0.0000 vs 0.0000)
    UTF-8, short (0-5 chars)         - master faster by 8.91% (0.0000 vs 0.0000)
    CP936, short (0-5 chars)         - master faster by 6.27% (0.0000 vs 0.0000)
    UHC, short (0-5 chars)           - master faster by 5.38% (0.0000 vs 0.0000)
    SJIS, short (0-5 chars)          - master faster by 5.20% (0.0000 vs 0.0000)

    UTF-8, medium (~100 chars)       - new faster by 127.51% (0.0004 vs 0.0002)
    UTF-8, long (~10000 chars)       - new faster by 87.94% (0.0319 vs 0.0170)
    UTF-8, very long (~100000 chars) - new faster by 88.25% (0.3199 vs 0.1699)

    SJIS, medium (~100 chars)        - new faster by 208.89% (0.0004 vs 0.0001)
    SJIS, long (~10000 chars)        - new faster by 253.57% (0.0319 vs 0.0090)

    CP936, medium (~100 chars)       - new faster by 126.08% (0.0004 vs 0.0002)
    CP936, long (~10000 chars)       - new faster by 200.48% (0.0319 vs 0.0106)

    EUC-KR, medium (~100 chars)      - new faster by 146.71% (0.0004 vs 0.0002)
    EUC-KR, long (~10000 chars)      - new faster by 212.05% (0.0319 vs 0.0102)

    EUC-JP, medium (~100 chars)      - new faster by 186.68% (0.0004 vs 0.0001)
    EUC-JP, long (~10000 chars)      - new faster by 295.37% (0.0320 vs 0.0081)

    BIG-5, medium (~100 chars)       - new faster by 173.07% (0.0004 vs 0.0001)
    BIG-5, long (~10000 chars)       - new faster by 269.19% (0.0319 vs 0.0086)

    UHC, medium (~100 chars)         - new faster by 196.99% (0.0004 vs 0.0001)
    UHC, long (~10000 chars)         - new faster by 256.39% (0.0323 vs 0.0091)

This does raise the question: is using the 'mblen_table' worthwhile for
other mbstring functions, such as mb_str_split? The answer is yes, it
is worthwhile; you see, while mb_strlen only needs to decode the input
string but not re-encode it, when mb_str_split is implemented using
the conversion filters, it needs to both decode the string and then
re-encode it. This means that there is more potential to gain
performance by using the 'mblen_table'. Benchmarking shows that in a
few cases, mb_str_split becomes faster when the 'mblen_table fast path'
is deleted, but in the majority of cases, it becomes slower.
MacJapanese has a somewhat unusual feature that when mapped to
Unicode, many characters map to sequences of several codepoints.
Add test cases demonstrating how mb_str_split and mb_substr behave in
this situation.

When adding these tests, I found the behavior of mb_substr was wrong
due to an inconsistency between the string "length" as measured by
mb_strlen and the number of native MacJapanese characters which
mb_substr would count when iterating over the string using the
mblen_table. This has been fixed.

I believe that mb_strstr will also return wrong results in some cases
for MacJapanese. I still need to come up with unit tests which
demonstrate the problem and figure out how to fix it.
@alexdowad
Copy link
Contributor Author

Please see the added commit; I added some more unit tests for mb_str_split and mb_substr on MacJapanese encoding, discovered a bug, and fixed it.

@Girgias
Copy link
Member

Girgias commented Jan 8, 2023

Please see the added commit; I added some more unit tests for mb_str_split and mb_substr on MacJapanese encoding, discovered a bug, and fixed it.

Also if you found a bug it probably should be backported

@alexdowad
Copy link
Contributor Author

Also if you found a bug it probably should be backported

Well, the bug only applies to this branch. Older versions of PHP should be fine.

@Girgias
Copy link
Member

Girgias commented Jan 8, 2023

Also if you found a bug it probably should be backported

Well, the bug only applies to this branch. Older versions of PHP should be fine.

ACK

@alexdowad alexdowad closed this Jan 8, 2023
@alexdowad alexdowad deleted the no_mbtable branch January 8, 2023 15:26
@alexdowad
Copy link
Contributor Author

Thanks, everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants