New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8298244: AArch64: Optimize vector implementation of AddReduction for floating point #11663
Conversation
…floating point The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1]. Take AddReductionVF with 128-bit as an example. Here is the assembly code before the patch: ``` fadd s18, s17, s16 mov v19.s[0], v16.s[1] fadd s18, s18, s19 mov v19.s[0], v16.s[2] fadd s18, s18, s19 mov v19.s[0], v16.s[3] fadd s18, s18, s19 ``` Here is the assembly code after the patch: ``` faddp v19.4s, v16.4s, v16.4s faddp s18, v19.2s fadd s18, s18, s17 ``` As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF. But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons: 1. JDK-8275275 disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API. 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code. Tier 1~3 passed with no new failures on Linux AArch64 platform. Here is the perf data of jmh benchmark [3] for the patch: Benchmark size Mode Cnt Before After Units Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar-- https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector-- [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
Webrevs
|
// Specially, the current vector implementation of Op_AddReductionVD works for | ||
// Vector API only because of the non-sequential order of element addition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Specially, the current vector implementation of Op_AddReductionVD works for | |
// Vector API only because of the non-sequential order of element addition. | |
// Floating-point addition is not associative, so we cannot auto-vectorize | |
// floating-point reduce-add. AddReductionVD is only generated by. explicit | |
// vector operations. |
// Specially, the current vector implementation of Op_AddReductionVF works for | ||
// Vector API only because of the non-sequential order of element addition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Specially, the current vector implementation of Op_AddReductionVF works for | |
// Vector API only because of the non-sequential order of element addition. | |
// Floating-point addition is not associative, so we cannot auto-vectorize | |
// floating-point reduce-add. AddReductionVD is only generated by. explicit | |
// vector operations. |
// Specially, the current vector implementation of Op_AddReductionVD/F works for | ||
// Vector API only. If re-enabling them for superword, precision loss will happen | ||
// because current generated code does not add elements sequentially from beginning to end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Specially, the current vector implementation of Op_AddReductionVD/F works for | |
// Vector API only. If re-enabling them for superword, precision loss will happen | |
// because current generated code does not add elements sequentially from beginning to end. | |
// The vector implementation of Op_AddReductionVD/F is for the Vector API only. | |
// It is not suitable for auto-vectorization because it does not add the elements | |
// in the same order as sequential code, and FPaddition is non-associative. |
// Specially, the current vector implementation of Op_AddReductionVD works for | ||
// Vector API only because of the non-sequential order of element addition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
Thanks for your kind review and comments, @theRealAph. I addressed them in the new commit. Could you please help review it? Thanks for your time! |
@fg1417 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 62 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @XiaohongGong) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks for your review, @theRealAph @XiaohongGong. I'll integrate it. /integrate |
/sponsor |
Going to push as commit ba942c2.
Your commit was automatically rebased without conflicts. |
The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1].
Take AddReductionVF with 128-bit as an example.
Here is the assembly code before the patch:
Here is the assembly code after the patch:
As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF.
But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons:
JDK-8275275 disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API.
Vector API relaxes the requirement for floating-point precision of
ADD
[2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code.Tier 1~3 passed with no new failures on Linux AArch64 platform.
Here is the perf data of jmh benchmark [3] for the patch:
Benchmark size Mode Cnt Before After Units
Double128Vector.addReduction 1024 thrpt 5 2167.146 2717.873 ops/ms
Float128Vector.addReduction 1024 thrpt 5 1706.253 4890.909 ops/ms
Float64Vector.addReduction 1024 thrpt 5 1907.425 2732.577 ops/ms
[1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar--
https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector--
[2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc
[3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316
https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11663/head:pull/11663
$ git checkout pull/11663
Update a local copy of the PR:
$ git checkout pull/11663
$ git pull https://git.openjdk.org/jdk pull/11663/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 11663
View PR using the GUI difftool:
$ git pr show -t 11663
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11663.diff