8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp by dgbo · Pull Request #1925 · openjdk/jdk

dgbo · 2021-01-04T06:23:33Z

This patch optimizes vectorial Min/Max reduction of two floating-point numbers on aarch64 with NEON instructions fmaxp and fminp.

Passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests under test/jdk/jdk/incubator/vector/ runned specially for the correctness and passed.

Introduced a new JMH micro test/micro/org/openjdk/bench/vm/compiler/VectorReductionFloatingMinMax.java for performance test.
Witnessed abount 37% performance improvements on Kunpeng916. The JMH Results:

Benchmark                (COUNT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReduction.maxRedD      512       0  avgt   10  678.126 ± 0.815  ns/op
VectorReduction.maxRedF      512       0  avgt   10  242.958 ± 0.212  ns/op
VectorReduction.minRedD      512       0  avgt   10  678.554 ± 0.824  ns/op
VectorReduction.minRedF      512       0  avgt   10  243.368 ± 0.205  ns/op

# Kunpeng 916, with fmaxp/fminp
VectorReduction.maxRedD      512       0  avgt   10  430.201 ± 0.353  ns/op => 36.56%
VectorReduction.maxRedF      512       0  avgt   10  243.404 ± 0.297  ns/op
VectorReduction.minRedD      512       0  avgt   10  427.805 ± 0.528  ns/op => 36.89%
VectorReduction.minRedF      512       0  avgt   10  242.963 ± 0.210  ns/op

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp

Reviewers

Andrew Haley (@theRealAph - Reviewer)

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1925/head:pull/1925
$ git checkout pull/1925

…/fmaxp

bridgekeeper · 2021-01-04T06:24:40Z

👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2021-01-04T06:25:50Z

@dgbo The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2021-01-04T06:29:09Z

Webrevs

dgbo · 2021-01-05T02:22:32Z

/label add hotspot-dev

openjdk · 2021-01-05T02:22:44Z

@dgbo
The hotspot label was successfully added.

theRealAph

This all looks reasonable. With the assembler change I think we should be good to go.

theRealAph · 2021-01-05T11:10:17Z

src/hotspot/cpu/aarch64/assembler_aarch64.hpp


 #undef INSN

+#define INSN(NAME, opc) \


This should be in an "AdvSIMD scalar pairwise" instruction group with faddp.

Done. The instructions faddp, fmax and fmin are put together in one group now.
Re-run the test/jdk/jdk/incubator/vector/ tests and passed.

dgbo · 2021-01-07T01:51:32Z

@theRealAph Hi, is there any further suggestions? :)

pfustc · 2021-01-07T03:03:31Z

test/micro/org/openjdk/bench/vm/compiler/VectorReductionFloatingMinMax.java

@@ -0,0 +1,97 @@
+/*
+ * Copyright (c) 2021, Huawei Technologies Co., Ltd. All rights reserved.


Should each new source file also include Oracle copyright?

Fixed, thank you for mentioning this.

Should each new source file also include Oracle copyright?

I'm not sure.
But there are already some files which don't contain Oracle copyright.

@dholmes-ora , what do you think of this question?
Thanks.

theRealAph · 2021-01-06T10:04:45Z

test/micro/org/openjdk/bench/vm/compiler/VectorReductionFloatingMinMax.java

+        float max = 0.0f;
+        for (int i = 0; i < COUNT; i++) {
+            max = Math.max(max, floatsA[i] - floatsB[i]);
+        }


This test code looks a bit contrived. If you're looking for the smallest delta it'd be

Math.max(max, Math.abs(floatsA[i] - floatsB[i]));

and if you're looking for the smallest value it'd probably be

Math.max(max, floatsA[i]);

Do we gain any advantage with these?

Hi,

As the experiment shows, we do not gain improvements with these.

For Math.max(max, Math.abs(floatsA[i] - floatsB[i])), the code is not vectorized if COUNT < 12.
When COUNT == 12, the node we have is not Max2F but Max4F.
The Math.max(max, floatsA[i] - floatsB[i]) suffers same problem that it does not match Max2F with small COUNT.

For Math.max(max, floatsA[i]), it is not auto-vectorized even with COUNT = 512.
I think the auto-vectorized optimization for this is disabled by JDK-8078563 [1].

One of the advantages of Max2F with fmaxp can gain for VectorAPI, the test code is available in [2].
We witnessed about 12% improvements for reduceLanes(VectorOperators.MAX) of FloatVector.SPECIES_64:

Benchmark (COUNT) (seed) Mode Cnt Score Error Units # Kunpeng 916, default VectorReduction2FMinMax.maxRed2F 512 0 avgt 10 667.173 ± 0.576 ns/op VectorReduction2FMinMax.minRed2F 512 0 avgt 10 667.172 ± 0.649 ns/op # Kunpeng 916, with fmaxp/fminp VectorReduction2FMinMax.maxRed2F 512 0 avgt 10 592.404 ± 0.885 ns/op VectorReduction2FMinMax.minRed2F 512 0 avgt 10 592.293 ± 0.607 ns/op

I agree the testcode for floats in VectorReductionFloatingMinMax.java is contrived.
Do you think we should replace the tests for MinMaxF in VectorReductionFloatingMinMax with tests in [2]?

[1] https://bugs.openjdk.java.net/browse/JDK-8078563
[2] VectorReduction2FMinMax.java.txt

I don't think the real problem is only the tests, it's that common cases don't get vectorized.
Can we fix this code so that it works with Math.abs() ?
Are there any examples of plausible Java code that benefit from this optimization?

According to the results of JMH perfasm, Math.max(max, Math.abs(floatsA[i] - floatsB[i])) is vectorized when COUNT=8 on a X86 platform.
While on aarch64, floatsB[i] = Math.abs(floatsA[i]) is not vectorized when COUNT = 10 and we can not match VAbs2F neither.
I am going to investigate the failed vectorization and see if we can have Max2F matched. Thanks.

Hi,

I made a mistake to say that the code is not vectorized with COUNT < 12, seems that the percentages of vectorized code is too small to be catched by JMH perfasm.
To observed if Min/MaxReductionVNode are created or not, I added a explicit print in ReductionNode::make, like:

--- a/src/hotspot/share/opto/vectornode.cpp +++ b/src/hotspot/share/opto/vectornode.cpp @@ -961,7 +961,9 @@ ReductionNode* ReductionNode::make(int opc, Node *ctrl, Node* n1, Node* n2, Basi case Op_MinReductionV: return new MinReductionVNode(ctrl, n1, n2); - case Op_MaxReductionV: return new MaxReductionVNode(ctrl, n1, n2); + case Op_MaxReductionV: + warning("in ReductionNode::make, making a MaxReductionVNode, length %d", n2->bottom_type()->is_vect()->length()); + return new MaxReductionVNode(ctrl, n1, n2); case Op_AndReductionV: return new AndReductionVNode(ctrl, n1, n2);

In my observation, we have Max4F when COUNT >= 4, it is resonable to create Max4F other than Max2F.
The Max2F is created with COUNT == 3 and -XX:-SuperWordLoopUnrollAnalysis.
But I did not find any noticeable improvements with such a small percentage.

The JMH has been updated, the performance results are:

Benchmark (COUNT_DOUBLE) (COUNT_FLOAT) (seed) Mode Cnt Score Error Units # Kunpeng 916, default VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 677.778 ± 0.694 ns/op VectorReductionFloatingMinMax.maxRedF 512 3 0 avgt 10 21.016 ± 0.097 ns/op VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 677.633 ± 0.664 ns/op VectorReductionFloatingMinMax.minRedF 512 3 0 avgt 10 21.001 ± 0.019 ns/op # Kunpeng 916, fmaxp/fminp VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 425.776 ± 0.785 ns/op VectorReductionFloatingMinMax.maxRedF 512 3 0 avgt 10 20.883 ± 0.033 ns/op VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 426.177 ± 3.258 ns/op VectorReductionFloatingMinMax.minRedF 512 3 0 avgt 10 20.871 ± 0.044 ns/op

Did you try math.abs() for doubles?

The Math.abs(doublesA[i] - doublesB[i]) has ~36% improvements.
I updated the tests for doubles with Math.abs(), it looks more consistent. Thanks.
The JMH results of doubles with Math.abs():

Benchmark (COUNT_DOUBLE) (COUNT_FLOAT) (seed) Mode Cnt Score Error Units # Kunpeng 916, default VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 681.319 ± 0.658 ns/op VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 682.596 ± 4.322 ns/op # Kunpeng 916, fmaxp/fminp VectorReductionFloatingMinMax.maxRedD 512 3 0 avgt 10 439.130 ± 0.450 ns/op => 35.54% VectorReductionFloatingMinMax.minRedD 512 3 0 avgt 10 439.105 ± 0.435 ns/op => 35.67%

For single-precision floating-point operands, as the experiments showed, we can have Max2F match only with COUNT == 3.
With such a small loop under superword framework, it is diffcult to tell how much improvements of fmaxp/fminp over fmaxv+ins.

Although it sounds unreasonable for an application to use Float64Vector rather than Float128Vecotor,
the optimization does being useful for VectorAPI Float64Vector.reduceLanes(VectorOperators.MAX) as mentioned previously.

Do you think we should remove single-precision floating-point parts in this patch?

OK, I guess we'll keep both. Even though the acceleration for single-precision float is disappointing on these cores, it might well be useful for some future processor, and I do care about the Vector API.

openjdk · 2021-01-12T10:27:15Z

@dgbo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 157 new commits pushed to the master branch:

563b268: 8257709: C1: Double assignment in InstructionPrinter::print_stack
400dc76: 8252015: [macos11] java.awt.TrayIcon requires updates for template images
ac2dee5: 8259539: JDK-8255711 broke trap messages
4697cfa: 8259576: Misplaced curly brace in Matcher::find_shared_post_visit
a6ab9e4: 8258576: Try to get zerobased CCS if heap is above 32 and CDS is disabled
a3561ae: 8258243: C2: assert failed ("Bad derived pointer") with -XX:+VerifyRegisterAllocator
4663704: 8259583: Remove unused decode_env::_codeBuffer
9d4e84f: 8259565: Zero: compiler/runtime/criticalnatives fails because CriticalJNINatives is not supported
98ccfbf: 8255710: Opensource unit/regression tests for CMM
61c5b95: 7194219: java/awt/Component/UpdatingBootTime/UpdatingBootTime.html fails on Linux
... and 147 more: https://git.openjdk.java.net/jdk/compare/57217b58bfd28c7ccbc0c3602e878c085dd865b9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

dgbo · 2021-01-12T12:56:03Z

/integrate

@theRealAph Thanks for the review.

@pfustc @DamonFool Thank you for looking into this.
I see some aarch64 files created recently include Oracle copyright, e.g. aarch64_sve.ad.
And when I wrote the test code, I did take some files with Oracle copyright as references.
I think the Oracle copyright statements in testcode should be kept there.

openjdk · 2021-01-12T12:56:36Z

@dgbo
Your change (at version 4d6d32e) is now ready to be sponsored by a Committer.

RealFYang · 2021-01-12T13:27:22Z

/sponsor

openjdk · 2021-01-12T13:28:02Z

@RealFYang @dgbo Since your change was applied there have been 158 commits pushed to the master branch:

4c75d14: 8259374: Make ThreadInVMfromNative have ResetNoHandleMark
563b268: 8257709: C1: Double assignment in InstructionPrinter::print_stack
400dc76: 8252015: [macos11] java.awt.TrayIcon requires updates for template images
ac2dee5: 8259539: JDK-8255711 broke trap messages
4697cfa: 8259576: Misplaced curly brace in Matcher::find_shared_post_visit
a6ab9e4: 8258576: Try to get zerobased CCS if heap is above 32 and CDS is disabled
a3561ae: 8258243: C2: assert failed ("Bad derived pointer") with -XX:+VerifyRegisterAllocator
4663704: 8259583: Remove unused decode_env::_codeBuffer
9d4e84f: 8259565: Zero: compiler/runtime/criticalnatives fails because CriticalJNINatives is not supported
98ccfbf: 8255710: Opensource unit/regression tests for CMM
... and 148 more: https://git.openjdk.java.net/jdk/compare/57217b58bfd28c7ccbc0c3602e878c085dd865b9...master

Your commit was automatically rebased without conflicts.

Pushed as commit ccac7aa.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

kimbarrett · 2021-01-12T14:11:07Z

All linux-aarch64 builds have started failing in Oracle's CI testing after this change was pushed.
See https://bugs.openjdk.java.net/browse/JDK-8259629

dgbo · 2021-01-12T16:25:06Z

All linux-aarch64 builds have started failing in Oracle's CI testing after this change was pushed.
See https://bugs.openjdk.java.net/browse/JDK-8259629

I'm really sorry for producing a serious BUG here.
I have raise a PR [1] to fix this. Would you mind having a look? Thanks.

[1] #2052

dgbo added 2 commits December 25, 2020 10:48

8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp…

e5ca842

…/fmaxp

Update copyright

71048b8

openjdk bot added the rfr Pull request is ready for review label Jan 4, 2021

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 4, 2021

openjdk bot added the hotspot hotspot-dev@openjdk.org label Jan 5, 2021

rebase the .m4 entry

876939b

theRealAph reviewed Jan 5, 2021

View reviewed changes

put faddp/fmaxp/fminp together in a group

c0c6f51

pfustc reviewed Jan 7, 2021

View reviewed changes

include Oracle copyright in VectorReductionFloatingMinMax.java

96dc92f

theRealAph suggested changes Jan 7, 2021

View reviewed changes

dgbo added 2 commits January 11, 2021 13:09

modify test code to match Max2F

f244ed0

test improvements for doubles with Math.abs()

4d6d32e

theRealAph approved these changes Jan 12, 2021

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jan 12, 2021

openjdk bot added the sponsor Pull request is ready to be sponsored label Jan 12, 2021

openjdk bot closed this Jan 12, 2021

openjdk bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jan 12, 2021

openjdk bot removed the rfr Pull request is ready for review label Jan 12, 2021

dgbo deleted the aarch64_optimize_reduce_fmax branch January 12, 2021 13:28

		@@ -0,0 +1,97 @@
		/*
		* Copyright (c) 2021, Huawei Technologies Co., Ltd. All rights reserved.

Conversation

dgbo commented Jan 4, 2021 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Download

Uh oh!

bridgekeeper bot commented Jan 4, 2021

Uh oh!

openjdk bot commented Jan 4, 2021

Uh oh!

mlbridge bot commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

dgbo commented Jan 5, 2021

Uh oh!

openjdk bot commented Jan 5, 2021

Uh oh!

theRealAph left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgbo commented Jan 7, 2021

Uh oh!

pfustc Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgbo Jan 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgbo Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgbo commented Jan 12, 2021

Uh oh!

openjdk bot commented Jan 12, 2021

Uh oh!

RealFYang commented Jan 12, 2021

Uh oh!

openjdk bot commented Jan 12, 2021

Uh oh!

kimbarrett commented Jan 12, 2021

Uh oh!

dgbo commented Jan 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

dgbo commented Jan 4, 2021 •

edited by openjdk bot

Loading

mlbridge bot commented Jan 4, 2021 •

edited

Loading

theRealAph left a comment •

edited

Loading

pfustc Jan 7, 2021 •

edited

Loading

dgbo Jan 9, 2021 •

edited

Loading

dgbo Jan 11, 2021 •

edited

Loading

openjdk bot commented Jan 12, 2021 •

edited

Loading