8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp#1925
8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp#1925dgbo wants to merge 7 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into |
Webrevs
|
|
/label add hotspot-dev |
|
@dgbo |
|
|
||
| #undef INSN | ||
|
|
||
| #define INSN(NAME, opc) \ |
There was a problem hiding this comment.
This should be in an "AdvSIMD scalar pairwise" instruction group with faddp.
There was a problem hiding this comment.
Done. The instructions faddp, fmax and fmin are put together in one group now.
Re-run the test/jdk/jdk/incubator/vector/ tests and passed.
|
@theRealAph Hi, is there any further suggestions? :) |
| @@ -0,0 +1,97 @@ | |||
| /* | |||
| * Copyright (c) 2021, Huawei Technologies Co., Ltd. All rights reserved. | |||
There was a problem hiding this comment.
Should each new source file also include Oracle copyright?
There was a problem hiding this comment.
Fixed, thank you for mentioning this.
There was a problem hiding this comment.
Should each new source file also include Oracle copyright?
I'm not sure.
But there are already some files which don't contain Oracle copyright.
@dholmes-ora , what do you think of this question?
Thanks.
| float max = 0.0f; | ||
| for (int i = 0; i < COUNT; i++) { | ||
| max = Math.max(max, floatsA[i] - floatsB[i]); | ||
| } |
There was a problem hiding this comment.
This test code looks a bit contrived. If you're looking for the smallest delta it'd be
Math.max(max, Math.abs(floatsA[i] - floatsB[i]));
and if you're looking for the smallest value it'd probably be
Math.max(max, floatsA[i]);
Do we gain any advantage with these?
There was a problem hiding this comment.
Hi,
As the experiment shows, we do not gain improvements with these.
For Math.max(max, Math.abs(floatsA[i] - floatsB[i])), the code is not vectorized if COUNT < 12.
When COUNT == 12, the node we have is not Max2F but Max4F.
The Math.max(max, floatsA[i] - floatsB[i]) suffers same problem that it does not match Max2F with small COUNT.
For Math.max(max, floatsA[i]), it is not auto-vectorized even with COUNT = 512.
I think the auto-vectorized optimization for this is disabled by JDK-8078563 [1].
One of the advantages of Max2F with fmaxp can gain for VectorAPI, the test code is available in [2].
We witnessed about 12% improvements for reduceLanes(VectorOperators.MAX) of FloatVector.SPECIES_64:
Benchmark (COUNT) (seed) Mode Cnt Score Error Units
# Kunpeng 916, default
VectorReduction2FMinMax.maxRed2F 512 0 avgt 10 667.173 ± 0.576 ns/op
VectorReduction2FMinMax.minRed2F 512 0 avgt 10 667.172 ± 0.649 ns/op
# Kunpeng 916, with fmaxp/fminp
VectorReduction2FMinMax.maxRed2F 512 0 avgt 10 592.404 ± 0.885 ns/op
VectorReduction2FMinMax.minRed2F 512 0 avgt 10 592.293 ± 0.607 ns/op
I agree the testcode for floats in VectorReductionFloatingMinMax.java is contrived.
Do you think we should replace the tests for MinMaxF in VectorReductionFloatingMinMax with tests in [2]?
[1] https://bugs.openjdk.java.net/browse/JDK-8078563
[2] VectorReduction2FMinMax.java.txt
There was a problem hiding this comment.
I don't think the real problem is only the tests, it's that common cases don't get vectorized.
Can we fix this code so that it works with Math.abs() ?
Are there any examples of plausible Java code that benefit from this optimization?
There was a problem hiding this comment.
According to the results of JMH perfasm, Math.max(max, Math.abs(floatsA[i] - floatsB[i])) is vectorized when COUNT=8 on a X86 platform.
While on aarch64, floatsB[i] = Math.abs(floatsA[i]) is not vectorized when COUNT = 10 and we can not match VAbs2F neither.
I am going to investigate the failed vectorization and see if we can have Max2F matched. Thanks.
There was a problem hiding this comment.
Hi,
I made a mistake to say that the code is not vectorized with COUNT < 12, seems that the percentages of vectorized code is too small to be catched by JMH perfasm.
To observed if Min/MaxReductionVNode are created or not, I added a explicit print in ReductionNode::make, like:
--- a/src/hotspot/share/opto/vectornode.cpp
+++ b/src/hotspot/share/opto/vectornode.cpp
@@ -961,7 +961,9 @@ ReductionNode* ReductionNode::make(int opc, Node *ctrl, Node* n1, Node* n2, Basi
case Op_MinReductionV: return new MinReductionVNode(ctrl, n1, n2);
- case Op_MaxReductionV: return new MaxReductionVNode(ctrl, n1, n2);
+ case Op_MaxReductionV:
+ warning("in ReductionNode::make, making a MaxReductionVNode, length %d", n2->bottom_type()->is_vect()->length());
+ return new MaxReductionVNode(ctrl, n1, n2);
case Op_AndReductionV: return new AndReductionVNode(ctrl, n1, n2);
In my observation, we have Max4F when COUNT >= 4, it is resonable to create Max4F other than Max2F.
The Max2F is created with COUNT == 3 and -XX:-SuperWordLoopUnrollAnalysis.
But I did not find any noticeable improvements with such a small percentage.
The JMH has been updated, the performance results are:
Benchmark (COUNT_DOUBLE) (COUNT_FLOAT) (seed) Mode Cnt Score Error Units
# Kunpeng 916, default
VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 677.778 ± 0.694 ns/op
VectorReductionFloatingMinMax.maxRedF 512 3 0 avgt 10 21.016 ± 0.097 ns/op
VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 677.633 ± 0.664 ns/op
VectorReductionFloatingMinMax.minRedF 512 3 0 avgt 10 21.001 ± 0.019 ns/op
# Kunpeng 916, fmaxp/fminp
VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 425.776 ± 0.785 ns/op
VectorReductionFloatingMinMax.maxRedF 512 3 0 avgt 10 20.883 ± 0.033 ns/op
VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 426.177 ± 3.258 ns/op
VectorReductionFloatingMinMax.minRedF 512 3 0 avgt 10 20.871 ± 0.044 ns/op
There was a problem hiding this comment.
Did you try math.abs() for doubles?
There was a problem hiding this comment.
The Math.abs(doublesA[i] - doublesB[i]) has ~36% improvements.
I updated the tests for doubles with Math.abs(), it looks more consistent. Thanks.
The JMH results of doubles with Math.abs():
Benchmark (COUNT_DOUBLE) (COUNT_FLOAT) (seed) Mode Cnt Score Error Units
# Kunpeng 916, default
VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 681.319 ± 0.658 ns/op
VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 682.596 ± 4.322 ns/op
# Kunpeng 916, fmaxp/fminp
VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 439.130 ± 0.450 ns/op => 35.54%
VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 439.105 ± 0.435 ns/op => 35.67%
There was a problem hiding this comment.
For single-precision floating-point operands, as the experiments showed, we can have Max2F match only with COUNT == 3.
With such a small loop under superword framework, it is diffcult to tell how much improvements of fmaxp/fminp over fmaxv+ins.
Although it sounds unreasonable for an application to use Float64Vector rather than Float128Vecotor,
the optimization does being useful for VectorAPI Float64Vector.reduceLanes(VectorOperators.MAX) as mentioned previously.
Do you think we should remove single-precision floating-point parts in this patch?
There was a problem hiding this comment.
OK, I guess we'll keep both. Even though the acceleration for single-precision float is disappointing on these cores, it might well be useful for some future processor, and I do care about the Vector API.
|
@dgbo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 157 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
/integrate @theRealAph Thanks for the review. @pfustc @DamonFool Thank you for looking into this. |
|
/sponsor |
|
@RealFYang @dgbo Since your change was applied there have been 158 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit ccac7aa. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
All linux-aarch64 builds have started failing in Oracle's CI testing after this change was pushed. |
I'm really sorry for producing a serious BUG here. [1] #2052 |
This patch optimizes vectorial Min/Max reduction of two floating-point numbers on aarch64 with NEON instructions
fmaxpandfminp.Passed jtreg tier1-3 tests with
linux-aarch64-server-fastdebugbuild.Tests under
test/jdk/jdk/incubator/vector/runned specially for the correctness and passed.Introduced a new JMH micro
test/micro/org/openjdk/bench/vm/compiler/VectorReductionFloatingMinMax.javafor performance test.Witnessed abount
37%performance improvements on Kunpeng916. The JMH Results:Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/1925/head:pull/1925$ git checkout pull/1925