Skip to content

Add specialized UTF-8 validation function for hosts with no SSE2/AVX2 support #10452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

alexdowad
Copy link
Contributor

In a GitHub thread, Michael Voříšek and Kamil Tekiela mentioned that the PCRE2 function pcre_match can be used to validate UTF-8, and that historically it was more efficient than mbstring's mb_check_encoding.

mb_check_encoding is now much faster on hosts with SSE2, and much faster again on hosts with AVX2. However, while all x86-64 CPUs support at least SSE2, not all PHP users run their code on x86-64 hardware. For example, some use recent Macs with ARM CPUs.

Therefore, borrow PCRE2's UTF-8 validation function as a fallback for hosts with no SSE2/AVX2 support. On long UTF-8 strings, this code is 50% faster than mbstring's existing fallback code.

FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai @mvorisek

… support

In a GitHub thread, Michael Voříšek and Kamil Tekiela mentioned that
the PCRE2 function `pcre_match` can be used to validate UTF-8, and that
historically it was more efficient than mbstring's `mb_check_encoding`.

`mb_check_encoding` is now much faster on hosts with SSE2, and much
faster again on hosts with AVX2. However, while all x86-64 CPUs support
at least SSE2, not all PHP users run their code on x86-64 hardware.
For example, some use recent Macs with ARM CPUs.

Therefore, borrow PCRE2's UTF-8 validation function as a fallback for
hosts with no SSE2/AVX2 support. On long UTF-8 strings, this code is
50% faster than mbstring's existing fallback code.
@youkidearitai
Copy link
Contributor

I checked with Ubuntu on Raspberry Pi, it seems no particular problem.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but looks good otherwise. :-)

@alexdowad
Copy link
Contributor Author

I checked with Ubuntu on Raspberry Pi, it seems no particular problem.

Thanks for testing!

@alexdowad
Copy link
Contributor Author

I'm just going to do a bit of fuzzing before landing this commit. I don't expect to find anything, but...

@alexdowad
Copy link
Contributor Author

The fuzzer didn't find anything. I didn't run it for a very long time, but long enough to try a few million random test cases.

@alexdowad alexdowad closed this Jan 26, 2023
@alexdowad alexdowad deleted the pcre branch January 26, 2023 19:00
@alexdowad
Copy link
Contributor Author

Landed. Many thanks to all who reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants