Skip to content

8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions #3683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

dgbo
Copy link
Member

@dgbo dgbo commented Apr 26, 2021

On aarch64, current implementations of vector reduce_add2I, reduce_max2I, reduce_min2I can be optimized with NEON pairwise instructions:

## reduce_add2I, before
mov    w10, v19.s[0]
mov    w2, v19.s[1]
add    w10, w0, w10
add    w10, w10, w2
## reduce_add2I, optimized
addp   v23.2s, v24.2s, v24.2s
mov    w10, v23.s[0]
add    w10, w10, w2

## reduce_max2I, before
dup    v16.2d, v23.d[0]
sminv  s16, v16.4s
mov    w10, v16.s[0]
cmp    w10, w0
csel   w10, w10, w0, lt
## reduce_max2I, optimized
sminp  v16.2s, v23.2s, v23.2s
mov    w10, v16.s[0]
cmp    w10, w0
csel   w10, w10, w0, lt

I don't expect this to change anything of SuperWord, vectorizing reductions of two integers is disabled by [1].
This is useful for VectorAPI, tested benchmarks in [2], performance can improve ~51% and ~8% for Int64Vector.ADD and Int64Vector.MAX respectively.

Benchmark                   (size)   Mode  Cnt     Score    Error   Units
# optimized
Int64Vector.ADDLanes          1024  thrpt   10  2492.123 ± 23.561  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt   10  1825.882 ±  5.261  ops/ms
Int64Vector.MAXLanes          1024  thrpt   10  1921.028 ±  3.253  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt   10  1588.575 ±  3.903  ops/ms
Int64Vector.MINLanes          1024  thrpt   10  1923.913 ±  2.117  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt   10  1596.875 ±  2.163  ops/ms
# default
Int64Vector.ADDLanes          1024  thrpt   10  1644.223 ±  1.885  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt   10  1491.502 ± 26.436  ops/ms
Int64Vector.MAXLanes          1024  thrpt   10  1784.066 ±  3.816  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt   10  1494.750 ±  3.451  ops/ms
Int64Vector.MINLanes          1024  thrpt   10  1785.266 ±  8.893  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt   10  1499.233 ±  3.498  ops/ms

Verified correctness with tests test/jdk/jdk/incubator/vector/. Also tested linux-aarch64-server-fastdebug tier1-3.

[1]

// Length 2 reductions of INT/LONG do not offer performance benefits

[2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/Int64Vector.java


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3683/head:pull/3683
$ git checkout pull/3683

Update a local copy of the PR:
$ git checkout pull/3683
$ git pull https://git.openjdk.java.net/jdk pull/3683/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3683

View PR using the GUI difftool:
$ git pr show -t 3683

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3683.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 26, 2021

👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 26, 2021
@openjdk
Copy link

openjdk bot commented Apr 26, 2021

@dgbo The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 26, 2021
@dgbo
Copy link
Member Author

dgbo commented Apr 26, 2021

/label add hotspot
/label add hotspot-compiler

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Apr 26, 2021
@openjdk
Copy link

openjdk bot commented Apr 26, 2021

@dgbo
The hotspot label was successfully added.

@openjdk
Copy link

openjdk bot commented Apr 26, 2021

@dgbo The hotspot-compiler label was already applied.

@mlbridge
Copy link

mlbridge bot commented Apr 26, 2021

Webrevs

Copy link

@nsjian nsjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good to me. Thanks!

Comment on lines +2407 to +2408
INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit: do we need an assembler test case for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added in the updated commit. Thanks.

@dgbo
Copy link
Member Author

dgbo commented May 6, 2021

Hi, @theRealAph. Could you please take a look at this PR? Thanks.

@dgbo
Copy link
Member Author

dgbo commented May 10, 2021

PING? Any comments/suggestions are appreciated.
Although this has been reviewed by Ningsheng, we still need help from reviewers here.

@theRealAph
Copy link
Contributor

Looking now. I can't quite understand how to run tests from panama-vector on JDK head. If the JMH test is relevant to JDK, not just panama-vector, why not add it to JDK?

@PaulSandoz
Copy link
Member

@theRealAph we are still working out how best to bring the vector performance tests over to the test/micro area of mainline. (Some preliminary work is here). The perf tests in the panama-vector are under a maven project and it should be possible to build/run that project with a mainline build of the JDK (the tests should be compatible).

@dgbo
Copy link
Member Author

dgbo commented May 11, 2021

@theRealAph we are still working out how best to bring the vector performance tests over to the test/micro area of mainline. (Some preliminary work is here). The perf tests in the panama-vector are under a maven project and it should be possible to build/run that project with a mainline build of the JDK (the tests should be compatible).

Hi, @theRealAph @PaulSandoz. Thanks for the comments.

Because [1] has not been merged into mainline, we cannot build the tests directly with mainline JDK:

[ERROR] panama-vector/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/utf8/DecodeBench.java:[354,31] cannot find symbol
  symbol:   method intoCharArray(char[],int)
  location: class jdk.incubator.vector.ShortVector

When I tested this, the incompatible DecodeBench.java was deleted first since it is all about ShortVector and ByteVector rather than Int64Vector.
Then the perf tests were built/run with mainline builds of JDK:

mvn install
java -jar target/vector-benchmarks.jar benchmark.jdk.incubator.vector.Int64Vector.["M"\|"ADD"]+[AXIN]*["Masked"]*Lanes -wi 10 -w 1000ms -f 1 -i 10 -r 1000ms

[1] openjdk/panama-vector#22

@mlbridge
Copy link

mlbridge bot commented May 14, 2021

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 5/10/21 6:55 AM, Dong Bo wrote:

PING? Any comments/suggestions are appreciated.
Although this has been reviewed by Ningsheng, we still need help from reviewers here.

I'm testing this now.

@dgbo
Copy link
Member Author

dgbo commented May 17, 2021

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 5/10/21 6:55 AM, Dong Bo wrote:

PING? Any comments/suggestions are appreciated.
Although this has been reviewed by Ningsheng, we still need help from reviewers here.

I'm testing this now.

PING? Any suggestions? Thanks.

@theRealAph
Copy link
Contributor

Mailing list message from Andrew Haley on hotspot-compiler-dev:

On 5/10/21 6:55 AM, Dong Bo wrote:

PING? Any comments/suggestions are appreciated.
Although this has been reviewed by Ningsheng, we still need help from reviewers here.

I'm testing this now.

I'm back on this, and I can't see how to run the tests. Paul Sandoz says the perf tests in the panama-vector are under a maven project, but which maven project is that?

You say "When I tested this, the incompatible DecodeBench.java was deleted first since it is all about ShortVector and ByteVector rather than Int64Vector." but I don't know what that means.

Please provide step-by-step instructions that allow anyone to reproduce your results.

@dgbo
Copy link
Member Author

dgbo commented May 24, 2021

Mailing list message from Andrew Haley on hotspot-compiler-dev:
On 5/10/21 6:55 AM, Dong Bo wrote:

PING? Any comments/suggestions are appreciated.
Although this has been reviewed by Ningsheng, we still need help from reviewers here.

I'm testing this now.

I'm back on this, and I can't see how to run the tests. Paul Sandoz says the perf tests in the panama-vector are under a maven project, but which maven project is that?

You say "When I tested this, the incompatible DecodeBench.java was deleted first since it is all about ShortVector and ByteVector rather than Int64Vector." but I don't know what that means.

Please provide step-by-step instructions that allow anyone to reproduce your results.

Hi, here are the instructions I used to run the benchmarks:

  1. Get the benchmark project in https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark (I do this via git):
## git clone https://github.com/openjdk/panama-vector.git
## cd panama-vector
## git checkout -b vectorIntrinsics remotes/origin/vectorIntrinsics
  1. Delete incompatible DecodeBench.java and compile the project with mainline JDK:
## <set JAVA_HOME to the mainline JDK path>
## cd test/jdk/jdk/incubator/vector/benchmark
## rm src/main/java/benchmark/utf8/DecodeBench.java
## mvn install
  1. Run tests:
    ## <PATH_TO_TEST_JDK>/bin/java -jar target/vector-benchmarks.jar benchmark.jdk.incubator.vector.Int64Vector.["M"\|"ADD"]+[AXIN]*["Masked"]*Lanes -wi 10 -w 1000ms -f 1 -i 10 -r 1000ms

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changes look good.

I managed to reproduce your results.


Benchmark                   (size)   Mode  Cnt     Score    Error   Units
Int64Vector.ADDLanes          1024  thrpt    3  4958.747 ± 54.225  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt    3  4769.759 ± 12.736  ops/ms
Int64Vector.MAXLanes          1024  thrpt    3  2957.985 ± 88.671  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt    3  2921.381 ± 45.408  ops/ms
Int64Vector.MINLanes          1024  thrpt    3  2965.392 ± 25.236  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt    3  2923.870 ± 53.270  ops/ms

Benchmark                   (size)   Mode  Cnt     Score    Error   Units
Int64Vector.ADDLanes          1024  thrpt    3  3560.100 ± 79.753  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt    3  3585.672 ± 57.203  ops/ms
Int64Vector.MAXLanes          1024  thrpt    3  2951.659 ±  9.577  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt    3  2876.957 ± 37.005  ops/ms
Int64Vector.MINLanes          1024  thrpt    3  2953.476 ±  3.446  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt    3  2878.942 ± 50.281  ops/ms

@openjdk
Copy link

openjdk bot commented May 24, 2021

@dgbo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions

Reviewed-by: njian, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 558 new commits pushed to the master branch:

  • b4d4884: 8267126: javadoc should show "line and caret" for diagnostics.
  • 461a3fe: 8261478: InstanceKlass::set_classpath_index does not match comments
  • de27da7: 8267431: Rename InstanceKlass::has_old_class_version to can_be_verified_at_dumptime
  • c519ba2: 8267614: Outline VarHandleGuards exact behavior checks
  • f690959: 8267446: Taskqueue code fails with assert(bottom_relaxed() == age_top_relaxed()) failed: not empty
  • ebc9357: 8267329: Modernize Javadoc code to use instanceof with pattern matching
  • 209769b: 8267347: CDS record_linking_constraint asserts with unregistered class
  • a5467ae: 8267423: Fix copyrights in jpackage tests
  • bb085f6: 8265362: java/net/Socket/UdpSocket.java fails with "java.net.BindException: Address already in use" (macos-aarch64)
  • 640a2af: 8267184: Add -Djava.security.manager=allow to tests calling System.setSecurityManager
  • ... and 548 more: https://git.openjdk.java.net/jdk/compare/787908c7788021395885280946de70d03b7ab39b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@nsjian, @theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 24, 2021
@dgbo
Copy link
Member Author

dgbo commented May 25, 2021

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label May 25, 2021
@openjdk
Copy link

openjdk bot commented May 25, 2021

@dgbo
Your change (at version 9d9ee01) is now ready to be sponsored by a Committer.

@RealFYang
Copy link
Member

/sponsor

@openjdk openjdk bot closed this May 25, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 25, 2021
@openjdk
Copy link

openjdk bot commented May 25, 2021

@RealFYang @dgbo Since your change was applied there have been 558 commits pushed to the master branch:

  • b4d4884: 8267126: javadoc should show "line and caret" for diagnostics.
  • 461a3fe: 8261478: InstanceKlass::set_classpath_index does not match comments
  • de27da7: 8267431: Rename InstanceKlass::has_old_class_version to can_be_verified_at_dumptime
  • c519ba2: 8267614: Outline VarHandleGuards exact behavior checks
  • f690959: 8267446: Taskqueue code fails with assert(bottom_relaxed() == age_top_relaxed()) failed: not empty
  • ebc9357: 8267329: Modernize Javadoc code to use instanceof with pattern matching
  • 209769b: 8267347: CDS record_linking_constraint asserts with unregistered class
  • a5467ae: 8267423: Fix copyrights in jpackage tests
  • bb085f6: 8265362: java/net/Socket/UdpSocket.java fails with "java.net.BindException: Address already in use" (macos-aarch64)
  • 640a2af: 8267184: Add -Djava.security.manager=allow to tests calling System.setSecurityManager
  • ... and 548 more: https://git.openjdk.java.net/jdk/compare/787908c7788021395885280946de70d03b7ab39b...master

Your commit was automatically rebased without conflicts.

Pushed as commit 123cdd1.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dgbo dgbo deleted the optimize_reduceAdd2I branch May 25, 2021 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants