Major overhaul of mbstring (part 28) #10099

alexdowad · 2022-12-13T20:21:57Z

This PR reimplements mb_substr_count.

The performance gain from this change depends on the text encoding and input string size. For very small strings, other overheads tend to swamp the performance gains to some extent, such that the speedup is less than 2x. For medium-length strings (~100 bytes or so), the speedup is typically around 2.5x.

The greatest performance gains are for UTF-8 strings which have already been marked as valid (using the GC flags on the zend_string object); for those, the speedup is more than 10x in many cases. This is because we don't do any conversion at all of such strings, and just do a straight byte match (as would be the case if a non-multibyte-encoding-aware function was used).

The previous implementation first converted the haystack and needle to wchars, then searched for matches between the two sequences of wchars. Because we use -1 as an error marker when converting to wchars, error markers from invalid byte sequences in the haystack would match error markers from invalid byte sequences in the needle, even if the specific invalid byte sequence was different. I am not sure whether this behavior is really desirable or not, but anyways, this new implementation follows the same behavior so as not to cause BC breaks.

@nikic @cmb69 @Girgias @kamil-tekiela @youkidearitai

The performance gain from this change depends on the text encoding and input string size. For very small strings, other overheads tend to swamp the performance gains to some extent, such that the speedup is less than 2x. For medium-length strings (~100 bytes or so), the speedup is typically around 2.5x. The greatest performance gains are for UTF-8 strings which have already been marked as valid (using the GC flags on the zend_string object); for those, the speedup is more than 10x in many cases. The previous implementation first converted the haystack and needle to wchars, then searched for matches between the two sequences of wchars. Because we use -1 as an error marker when converting to wchars, error markers from invalid byte sequences in the haystack would match error markers from invalid byte sequences in the needle, even if the specific invalid byte sequence was different. I am not sure whether this behavior is really desirable or not, but anyways, this new implementation follows the same behavior so as not to cause BC breaks.

alexdowad · 2022-12-14T14:37:42Z

Next PR is ready once this one is reviewed...

Girgias

This looks reasonable to me.

alexdowad · 2022-12-15T06:10:17Z

@Girgias Thanks for review... merging.

If anyone else has suggestions for improvement, I can still make amendments in a separate commit.

github-actions bot added the Extension: mbstring label Dec 13, 2022

alexdowad force-pushed the cleanup-mbstring-28 branch from 4b0e920 to c2c8f42 Compare December 13, 2022 20:26

Girgias reviewed Dec 15, 2022

View reviewed changes

alexdowad closed this Dec 15, 2022

alexdowad deleted the cleanup-mbstring-28 branch December 21, 2022 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul of mbstring (part 28) #10099

Major overhaul of mbstring (part 28) #10099

alexdowad commented Dec 13, 2022

alexdowad commented Dec 14, 2022

Girgias left a comment

alexdowad commented Dec 15, 2022

Major overhaul of mbstring (part 28) #10099

Major overhaul of mbstring (part 28) #10099

Conversation

alexdowad commented Dec 13, 2022

alexdowad commented Dec 14, 2022

Girgias left a comment

Choose a reason for hiding this comment

alexdowad commented Dec 15, 2022