Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8277168: AArch64: Enable arraycopy partial inlining with SVE #6444

Closed
wants to merge 1 commit into from

Conversation

pfustc
Copy link
Member

@pfustc pfustc commented Nov 18, 2021

Arraycopy partial inlining is a C2 compiler technique that avoids stub
call overhead in small-sized arraycopy operations by generating masked
vector instructions. So far it works on x86 AVX512 only and this patch
enables it on AArch64 with SVE.

We add AArch64 matching rule for VectorMaskGenNode and refactor that
node a little bit. The major change is moving the element type field
into its TypeVectMask bottom type. The reason is that AArch64 vector
masks are different for different vector element types.

E.g., an x86 AVX512 vector mask value masking 3 least significant vector
lanes (of any type) is like

0000 0000 ... 0000 0000 0000 0000 0111

On AArch64 SVE, this mask value can only be used for masking the 3 least
significant lanes of bytes. But for 3 lanes of ints, the value should be

0000 0000 ... 0000 0000 0001 0001 0001

where the least significant bit of each lane matters. So AArch64 matcher
needs to know the vector element type to generate right masks.

After this patch, the C2 generated code for copying a 50-byte array on
AArch64 SVE looks like

  mov     x12, #0x32
  whilelo p0.b, xzr, x12
  add     x11, x11, #0x10
  ld1b    {z16.b}, p0/z, [x11]
  add     x10, x10, #0x10
  st1b    {z16.b}, p0, [x10]

We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
size arguments on a 512-bit SVE-featured CPU. We got below performance
data changes.

Benchmark                  (length)  (Performance)
ArrayCopyAligned.testByte        10          -2.6%
ArrayCopyAligned.testByte        20          +4.7%
ArrayCopyAligned.testByte        30          +4.8%
ArrayCopyAligned.testByte        40         +21.7%
ArrayCopyAligned.testByte        50         +22.5%
ArrayCopyAligned.testByte        60         +28.4%

The test machine has SVE vector size of 512 bits, so we see performance
gain for most array sizes less than 64 bytes. For very small arrays we
see a bit regression because a vector load/store may be a bit slower
than 1 or 2 scalar loads/stores.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8277168: AArch64: Enable arraycopy partial inlining with SVE

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6444/head:pull/6444
$ git checkout pull/6444

Update a local copy of the PR:
$ git checkout pull/6444
$ git pull https://git.openjdk.java.net/jdk pull/6444/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6444

View PR using the GUI difftool:
$ git pr show -t 6444

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6444.diff

Arraycopy partial inlining is a C2 compiler technique that avoids stub
call overhead in small-sized arraycopy operations by generating masked
vector instructions. So far it works on x86 AVX512 only and this patch
enables it on AArch64 with SVE.

We add AArch64 matching rule for VectorMaskGenNode and refactor that
node a little bit. The major change is moving the element type field
into its TypeVectMask bottom type. The reason is that AArch64 vector
masks are different for different vector element types.

E.g., an x86 AVX512 vector mask value masking 3 least significant vector
lanes (of any type) is like

  0000 0000 ... 0000 0000 0000 0000 0111

On AArch64 SVE, this mask value can only be used for masking the 3 least
significant lanes of bytes. But for 3 lanes of ints, the value should be

  0000 0000 ... 0000 0000 0001 0001 0001

where the least significant bit of each lane matters. So AArch64 matcher
needs to know the vector element type to generate right masks.

After this patch, the C2 generated code for copying a 50-byte array on
AArch64 SVE looks like

  mov     x12, #0x32
  whilelo p0.b, xzr, x12
  add     x11, x11, #0x10
  ld1b    {z16.b}, p0/z, [x11]
  add     x10, x10, #0x10
  st1b    {z16.b}, p0, [x10]

We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on
both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested
JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array
size arguments on a 512-bit SVE-featured CPU. We got below performance
data changes.

Benchmark                  (length)  (Performance)
ArrayCopyAligned.testByte        10          -2.6%
ArrayCopyAligned.testByte        20          +4.7%
ArrayCopyAligned.testByte        30          +4.8%
ArrayCopyAligned.testByte        40         +21.7%
ArrayCopyAligned.testByte        50         +22.5%
ArrayCopyAligned.testByte        60         +28.4%

The test machine has SVE vector size of 512 bits, so we see performance
gain for most array sizes less than 64 bytes. For very small arrays we
see a bit regression because a vector load/store may be a bit slower
than 1 or 2 scalar loads/stores.
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 18, 2021

👋 Welcome back pli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 18, 2021
@openjdk
Copy link

openjdk bot commented Nov 18, 2021

@pfustc The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Nov 18, 2021
@mlbridge
Copy link

mlbridge bot commented Nov 18, 2021

Webrevs

@pfustc
Copy link
Member Author

pfustc commented Nov 18, 2021

/cc hotspot-compiler

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 18, 2021
@openjdk
Copy link

openjdk bot commented Nov 18, 2021

@pfustc
The hotspot-compiler label was successfully added.

@pfustc
Copy link
Member Author

pfustc commented Nov 18, 2021

The x86 failure is caused by a recent commit (see JDK-8277324) and unrelated to this PR.

@theRealAph
Copy link
Contributor

I'll have a look. It'll take me a little time to provision a suitable SVE-enabled AArch64 box.

@theRealAph
Copy link
Contributor

I'm having a lot of difficulty understanding how this is supposed to work.

Firstly, I'm not seeing a performance increase on a fujitsu-fx700.
Secondly, I'm not surprised: looking at the results of JMH -prof:perfasm, it seems to me that the only SVE instructions being executed are outside the timing loop in the testByte_ArrayCopyAligned_testByte_jmhTest:avgt_jmhStub method. I'm baffled by what is going on.

@theRealAph
Copy link
Contributor

I'm baffled by what is going on.

Sorry, it looks like I managed to confuse myself. The top of the loop looks like:

10c     B17: #      out( B18 ) <- in( B27 )  Freq: 4.49963
10c     # castLL of R2
10c     sve_whilelo P0, zr, R2       # sve
110     sve_ldr V16, P0, [R0]       # load vector predicated (sve)
114     sve_str [R1], P0, V16       # store vector predicated (sve)
118     B18: #      out( B30 B19 ) <- in( B17 B28 B26 )  Freq: 8.99927
118
118     ldarb  R10, [R23]   # byte ! Field: volatile org/openjdk/jmh/runner/InfraControlL2.isDone

... and the bottom

1a0     cmp  R2, #64
1a4     bls  B17    # unsigned  P=0.500000 C=-1.000000
1a8     B28: #      out( B18 ) <- in( B27 )  Freq: 4.49963
1a8     CALL, runtime leaf nofp 0x0000ffff6d1058f8 jbyte_arraycopy
        No JVM State Info
        #
1b0     b  B18

So only if the length is < 64 (i.e. 512 bits) do we branch back to B17 to do the SVE WHILELO to set the predicate. This is confusing only because the code has been rearranged so that the test for < 64 bytes is at the bottom of the loop.

@theRealAph
Copy link
Contributor

theRealAph commented Nov 18, 2021

Hurrah! I have managed to duplicate your results.

Old:

Benchmark                       (length)  Mode  Cnt   Score   Error  Units
ArrayCopyAligned.testByte             40  avgt    5  23.332 ± 0.016  ns/op

New:

ArrayCopyAligned.testByte             40  avgt    5  18.092 ± 0.093  ns/op

... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now.

@jatin-bhateja
Copy link
Member

Hi @pfustc , common type system changes looks good to me.

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common type system changes looks good to me.

@pfustc
Copy link
Member Author

pfustc commented Nov 23, 2021

Thank you for looking at my PR. This C2 technique was originally developed by @jatin-bhateja from Intel to optimize small-sized memory copy with x86 AVX-512 masked vector instructions. Now I propose to enable it on AArch64 with SVE. Yes, it has benefit only if the copy size is less than the size of a vector. It's 512 bits on x86, but on AArch64 SVE the max copy size it can benefit depends on the hardware's implementation of the scalable vector register (from 128 bits to 2048 bits).

@theRealAph , do you approve this PR? or any specific feedback or suggestion?

@pfustc
Copy link
Member Author

pfustc commented Dec 6, 2021

Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon.

@theRealAph
Copy link
Contributor

Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon.

I'm quite happy with the AArch64 parts, but I'm not familiar with that part of the C2 compiler. I think you need an additional reviewer, perhaps @rwestrel .

@pfustc
Copy link
Member Author

pfustc commented Dec 6, 2021

Thanks Andrew!

Can any reviewer help look at the C2 mid-end part?

Copy link
Contributor

@rwestrel rwestrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C2 platform independent code looks good to me.

@openjdk
Copy link

openjdk bot commented Dec 7, 2021

@pfustc This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8277168: AArch64: Enable arraycopy partial inlining with SVE

Reviewed-by: jbhateja, roland, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 268 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 7, 2021
@pfustc
Copy link
Member Author

pfustc commented Dec 8, 2021

Thanks @rwestrel . I will integrate this.

@pfustc
Copy link
Member Author

pfustc commented Dec 8, 2021

/integrate

@openjdk
Copy link

openjdk bot commented Dec 8, 2021

Going to push as commit e7db581.
Since your change was applied there have been 291 commits pushed to the master branch:

  • fb6d611: 8278276: G1: Refine naming of G1GCParPhaseTimesTracker::_must_record
  • d7ad546: 8276422: Add command-line option to disable finalization
  • ec7cb6d: 8276447: Deprecate finalization-related methods for removal
  • 3c2951f: 8275771: JDK source code contains redundant boolean operations in jdk.compiler and langtools
  • 3d61372: 8278363: Create extented container test groups
  • 716c2e1: 8278368: ProblemList tools/jpackage/share/MultiNameTwoPhaseTest.java on macosx-x64
  • a8a1fbc: 8278068: Fix next-line modifier (snippet markup)
  • 061017a: 8273175: Add @SInCE tags to the DocTree.Kind enum constants
  • d7c283a: 8275233: Incorrect line number reported in exception stack trace thrown from a lambda expression
  • 3955b03: 8277328: jdk/jshell/CommandCompletionTest.java failures on Windows
  • ... and 281 more: https://git.openjdk.java.net/jdk/compare/81938001f9bae56c59f4e18b7756089f2cf0bf74...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Dec 8, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated labels Dec 8, 2021
@openjdk openjdk bot removed the rfr Pull request is ready for review label Dec 8, 2021
@openjdk
Copy link

openjdk bot commented Dec 8, 2021

@pfustc Pushed as commit e7db581.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@pfustc pfustc deleted the arraycopy branch December 8, 2021 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
4 participants