Add specialized UTF-8 validation function for hosts with no SSE2/AVX2 support #10452
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In a GitHub thread, Michael Voříšek and Kamil Tekiela mentioned that the PCRE2 function
pcre_match
can be used to validate UTF-8, and that historically it was more efficient than mbstring'smb_check_encoding
.mb_check_encoding
is now much faster on hosts with SSE2, and much faster again on hosts with AVX2. However, while all x86-64 CPUs support at least SSE2, not all PHP users run their code on x86-64 hardware. For example, some use recent Macs with ARM CPUs.Therefore, borrow PCRE2's UTF-8 validation function as a fallback for hosts with no SSE2/AVX2 support. On long UTF-8 strings, this code is 50% faster than mbstring's existing fallback code.
FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai @mvorisek