Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NEON-accelerated version of BLOCKCONV for lowercasing and uppercasing strings #11161

Merged
merged 1 commit into from May 4, 2023

Conversation

nielsdos
Copy link
Member

Since lowercasing and uppercasing is a common operation for both internal purposes and userland purposes, it makes sense to implement a NEON accelerated version for this.

…ppercasing strings

Since lowercasing and uppercasing is a common operation for both
internal purposes and userland purposes, it makes sense to implement a
NEON accelerated version for this.
@alexdowad
Copy link
Contributor

@nielsdos Thank you very much!

Two questions: Do all ARM64 hosts support the NEON instruction set? And second, can you benchmark this code and let us know how much it improves performance?

FYI @youkidearitai @easyaspi314

@nielsdos
Copy link
Member Author

Do all ARM64 hosts support the NEON instruction set?

The AArch64 version of the NEON instruction set (which is used here, and is not the same as the original NEON instruction set) is supported for all ARM64 hosts. See https://en.wikipedia.org/wiki/AArch64 which says

ARMv8-A makes VFPv3/v4 and advanced SIMD (Neon) standard

ARMv8-A is also the version of ARM that introduced the AArch64 mode, so therefore it will always be supported.

Benchmarks

And second, can you benchmark this code and let us know how much it improves performance?

Important: I don't own a ARM64 machine, the tests and development of this patch were performed on an x86-64 host with qemu-user to emulate an AArch64 build of PHP.
Hence, the benchmark results are only a rough indication, because of the emulation. Also an additional problem is that I'm benching on a laptop, so the confidence interval is rather large.
I only benchmarked strtolower() because the strtoupper() code is basically the exact same, but with another check condition. So their performance must be practically identical.

Benchmark: strtolower() Strings that are already all lowercase (1000 iterations)

For a string of size 100: 1.01 ± 0.10 times faster than old approach
For a string of size 1000: 1.34 ± 0.20 times faster than old approach
For a string of size 10000: 2.27 ± 0.22 times faster than old approach

Benchmark: strtolower() Strings where the first half is lowercase and second half is uppercase (1000 iterations)

For a string of size 100: 1.18 ± 0.15 times faster than old approach
For a string of size 1000: 2.28 ± 0.26 times faster than old approach
For a string of size 10000: 6.90 ± 0.65 times faster than old approach

Benchmark: strtolower() Strings that are all upercase (1000 iterations)

For a string of size 100: 1.36 ± 0.18 times faster than old approach
For a string of size 1000: 3.79 ± 0.50 times faster than old approach
For a string of size 10000: 13.42 ± 0.85 times faster than old approach

@alexdowad
Copy link
Contributor

@nielsdos Looks good to me!

Any comments from others?

@youkidearitai
Copy link
Contributor

I don't know well to ZendEngine and SSE (NEON is not confident to little bit), BLOCKCONV_FOUND returns max value if use vmaxvq_u8 by NEON. SSE use _mm_movemask_epi8 that says intel's homepage to below.

Create mask from the most significant bit of each 8-bit element in a

Is it okay to difference?

@nielsdos
Copy link
Member Author

I don't know well to ZendEngine and SSE (NEON is not confident to little bit), BLOCKCONV_FOUND returns max value if use vmaxvq_u8 by NEON. SSE use _mm_movemask_epi8 that says intel's homepage to below.

Create mask from the most significant bit of each 8-bit element in a

Is it okay to difference?

The goal of BLOCKCONV_FOUND is to detect if there was at least one character element where the comparison was true.
Both the SSE and NEON version will use a "less than" instruction which sets the element to all one bits if the comparison is true, and all zero bits if the comparison is false. On SSE we use movemask to gather all the highest order bits, which results in a value != 0 if there was at least one true value. On NEON I do this by using the max, if there was a comparison which yielded all ones, then the max will be all ones, otherwise it will be all zeros. So the two versions are equivalent.

@youkidearitai
Copy link
Contributor

@nielsdos Thank you very much for response. I understand. Looks good to me.

@alexdowad
Copy link
Contributor

Waiting to hear from @iluuu1994, if he feels like commenting.

@nielsdos, if there are no comments after a few days, please ping me and I will merge this.

@nielsdos
Copy link
Member Author

@alexdowad I have merge access, so I can merge this myself. So I'll wait a few days for more comments and if it's all good I can merge this myself :)

@nielsdos nielsdos merged commit a65cdd9 into php:master May 4, 2023
12 of 13 checks passed
@nielsdos
Copy link
Member Author

nielsdos commented May 4, 2023

Thanks for the reviews :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants