New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8277168: AArch64: Enable arraycopy partial inlining with SVE #6444
Conversation
Arraycopy partial inlining is a C2 compiler technique that avoids stub call overhead in small-sized arraycopy operations by generating masked vector instructions. So far it works on x86 AVX512 only and this patch enables it on AArch64 with SVE. We add AArch64 matching rule for VectorMaskGenNode and refactor that node a little bit. The major change is moving the element type field into its TypeVectMask bottom type. The reason is that AArch64 vector masks are different for different vector element types. E.g., an x86 AVX512 vector mask value masking 3 least significant vector lanes (of any type) is like 0000 0000 ... 0000 0000 0000 0000 0111 On AArch64 SVE, this mask value can only be used for masking the 3 least significant lanes of bytes. But for 3 lanes of ints, the value should be 0000 0000 ... 0000 0000 0001 0001 0001 where the least significant bit of each lane matters. So AArch64 matcher needs to know the vector element type to generate right masks. After this patch, the C2 generated code for copying a 50-byte array on AArch64 SVE looks like mov x12, #0x32 whilelo p0.b, xzr, x12 add x11, x11, #0x10 ld1b {z16.b}, p0/z, [x11] add x10, x10, #0x10 st1b {z16.b}, p0, [x10] We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array size arguments on a 512-bit SVE-featured CPU. We got below performance data changes. Benchmark (length) (Performance) ArrayCopyAligned.testByte 10 -2.6% ArrayCopyAligned.testByte 20 +4.7% ArrayCopyAligned.testByte 30 +4.8% ArrayCopyAligned.testByte 40 +21.7% ArrayCopyAligned.testByte 50 +22.5% ArrayCopyAligned.testByte 60 +28.4% The test machine has SVE vector size of 512 bits, so we see performance gain for most array sizes less than 64 bytes. For very small arrays we see a bit regression because a vector load/store may be a bit slower than 1 or 2 scalar loads/stores.
👋 Welcome back pli! A progress list of the required criteria for merging this PR into |
/cc hotspot-compiler |
@pfustc |
The x86 failure is caused by a recent commit (see JDK-8277324) and unrelated to this PR. |
I'll have a look. It'll take me a little time to provision a suitable SVE-enabled AArch64 box. |
I'm having a lot of difficulty understanding how this is supposed to work. Firstly, I'm not seeing a performance increase on a fujitsu-fx700. |
Sorry, it looks like I managed to confuse myself. The top of the loop looks like:
... and the bottom
So only if the length is < 64 (i.e. 512 bits) do we branch back to B17 to do the |
Hurrah! I have managed to duplicate your results. Old:
New:
... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now. |
Hi @pfustc , common type system changes looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common type system changes looks good to me.
Thank you for looking at my PR. This C2 technique was originally developed by @jatin-bhateja from Intel to optimize small-sized memory copy with x86 AVX-512 masked vector instructions. Now I propose to enable it on AArch64 with SVE. Yes, it has benefit only if the copy size is less than the size of a vector. It's 512 bits on x86, but on AArch64 SVE the max copy size it can benefit depends on the hardware's implementation of the scalable vector register (from 128 bits to 2048 bits). @theRealAph , do you approve this PR? or any specific feedback or suggestion? |
Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon. |
I'm quite happy with the AArch64 parts, but I'm not familiar with that part of the C2 compiler. I think you need an additional reviewer, perhaps @rwestrel . |
Thanks Andrew! Can any reviewer help look at the C2 mid-end part? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C2 platform independent code looks good to me.
@pfustc This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 268 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thanks @rwestrel . I will integrate this. |
/integrate |
Going to push as commit e7db581.
Your commit was automatically rebased without conflicts. |
Arraycopy partial inlining is a C2 compiler technique that avoids stub
call overhead in small-sized arraycopy operations by generating masked
vector instructions. So far it works on x86 AVX512 only and this patch
enables it on AArch64 with SVE.
We add AArch64 matching rule for VectorMaskGenNode and refactor that
node a little bit. The major change is moving the element type field
into its TypeVectMask bottom type. The reason is that AArch64 vector
masks are different for different vector element types.
E.g., an x86 AVX512 vector mask value masking 3 least significant vector
lanes (of any type) is like
0000 0000 ... 0000 0000 0000 0000 0111
On AArch64 SVE, this mask value can only be used for masking the 3 least
significant lanes of bytes. But for 3 lanes of ints, the value should be
0000 0000 ... 0000 0000 0001 0001 0001
where the least significant bit of each lane matters. So AArch64 matcher
needs to know the vector element type to generate right masks.
After this patch, the C2 generated code for copying a 50-byte array on
AArch64 SVE looks like
We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
size arguments on a 512-bit SVE-featured CPU. We got below performance
data changes.
The test machine has SVE vector size of 512 bits, so we see performance
gain for most array sizes less than 64 bytes. For very small arrays we
see a bit regression because a vector load/store may be a bit slower
than 1 or 2 scalar loads/stores.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6444/head:pull/6444
$ git checkout pull/6444
Update a local copy of the PR:
$ git checkout pull/6444
$ git pull https://git.openjdk.java.net/jdk pull/6444/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 6444
View PR using the GUI difftool:
$ git pr show -t 6444
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6444.diff