Skip to content

8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp#1925

Closed
dgbo wants to merge 7 commits intoopenjdk:masterfrom
dgbo:aarch64_optimize_reduce_fmax
Closed

8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp#1925
dgbo wants to merge 7 commits intoopenjdk:masterfrom
dgbo:aarch64_optimize_reduce_fmax

Conversation

@dgbo
Copy link
Member

@dgbo dgbo commented Jan 4, 2021

This patch optimizes vectorial Min/Max reduction of two floating-point numbers on aarch64 with NEON instructions fmaxp and fminp.

Passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests under test/jdk/jdk/incubator/vector/ runned specially for the correctness and passed.

Introduced a new JMH micro test/micro/org/openjdk/bench/vm/compiler/VectorReductionFloatingMinMax.java for performance test.
Witnessed abount 37% performance improvements on Kunpeng916. The JMH Results:

Benchmark                (COUNT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReduction.maxRedD      512       0  avgt   10  678.126 ± 0.815  ns/op
VectorReduction.maxRedF      512       0  avgt   10  242.958 ± 0.212  ns/op
VectorReduction.minRedD      512       0  avgt   10  678.554 ± 0.824  ns/op
VectorReduction.minRedF      512       0  avgt   10  243.368 ± 0.205  ns/op

# Kunpeng 916, with fmaxp/fminp
VectorReduction.maxRedD      512       0  avgt   10  430.201 ± 0.353  ns/op => 36.56%
VectorReduction.maxRedF      512       0  avgt   10  243.404 ± 0.297  ns/op
VectorReduction.minRedD      512       0  avgt   10  427.805 ± 0.528  ns/op => 36.89%
VectorReduction.minRedF      512       0  avgt   10  242.963 ± 0.210  ns/op

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1925/head:pull/1925
$ git checkout pull/1925

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 4, 2021

👋 Welcome back dongbo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 4, 2021
@openjdk
Copy link

openjdk bot commented Jan 4, 2021

@dgbo The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 4, 2021
@mlbridge
Copy link

mlbridge bot commented Jan 4, 2021

Webrevs

@dgbo
Copy link
Member Author

dgbo commented Jan 5, 2021

/label add hotspot-dev

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jan 5, 2021
@openjdk
Copy link

openjdk bot commented Jan 5, 2021

@dgbo
The hotspot label was successfully added.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks reasonable. With the assembler change I think we should be good to go.


#undef INSN

#define INSN(NAME, opc) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in an "AdvSIMD scalar pairwise" instruction group with faddp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The instructions faddp, fmax and fmin are put together in one group now.
Re-run the test/jdk/jdk/incubator/vector/ tests and passed.

@dgbo
Copy link
Member Author

dgbo commented Jan 7, 2021

@theRealAph Hi, is there any further suggestions? :)

@@ -0,0 +1,97 @@
/*
* Copyright (c) 2021, Huawei Technologies Co., Ltd. All rights reserved.
Copy link
Member

@pfustc pfustc Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each new source file also include Oracle copyright?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thank you for mentioning this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each new source file also include Oracle copyright?

I'm not sure.
But there are already some files which don't contain Oracle copyright.

@dholmes-ora , what do you think of this question?
Thanks.

float max = 0.0f;
for (int i = 0; i < COUNT; i++) {
max = Math.max(max, floatsA[i] - floatsB[i]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test code looks a bit contrived. If you're looking for the smallest delta it'd be

Math.max(max, Math.abs(floatsA[i] - floatsB[i]));

and if you're looking for the smallest value it'd probably be

Math.max(max, floatsA[i]);

Do we gain any advantage with these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

As the experiment shows, we do not gain improvements with these.

For Math.max(max, Math.abs(floatsA[i] - floatsB[i])), the code is not vectorized if COUNT < 12.
When COUNT == 12, the node we have is not Max2F but Max4F.
The Math.max(max, floatsA[i] - floatsB[i]) suffers same problem that it does not match Max2F with small COUNT.

For Math.max(max, floatsA[i]), it is not auto-vectorized even with COUNT = 512.
I think the auto-vectorized optimization for this is disabled by JDK-8078563 [1].

One of the advantages of Max2F with fmaxp can gain for VectorAPI, the test code is available in [2].
We witnessed about 12% improvements for reduceLanes(VectorOperators.MAX) of FloatVector.SPECIES_64:

Benchmark                         (COUNT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReduction2FMinMax.maxRed2F      512       0  avgt   10  667.173 ± 0.576  ns/op
VectorReduction2FMinMax.minRed2F      512       0  avgt   10  667.172 ± 0.649  ns/op
# Kunpeng 916, with fmaxp/fminp
VectorReduction2FMinMax.maxRed2F      512       0  avgt   10  592.404 ± 0.885  ns/op
VectorReduction2FMinMax.minRed2F      512       0  avgt   10  592.293 ± 0.607  ns/op

I agree the testcode for floats in VectorReductionFloatingMinMax.java is contrived.
Do you think we should replace the tests for MinMaxF in VectorReductionFloatingMinMax with tests in [2]?

[1] https://bugs.openjdk.java.net/browse/JDK-8078563
[2] VectorReduction2FMinMax.java.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the real problem is only the tests, it's that common cases don't get vectorized.
Can we fix this code so that it works with Math.abs() ?
Are there any examples of plausible Java code that benefit from this optimization?

Copy link
Member Author

@dgbo dgbo Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the results of JMH perfasm, Math.max(max, Math.abs(floatsA[i] - floatsB[i])) is vectorized when COUNT=8 on a X86 platform.
While on aarch64, floatsB[i] = Math.abs(floatsA[i]) is not vectorized when COUNT = 10 and we can not match VAbs2F neither.
I am going to investigate the failed vectorization and see if we can have Max2F matched. Thanks.

Copy link
Member Author

@dgbo dgbo Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I made a mistake to say that the code is not vectorized with COUNT < 12, seems that the percentages of vectorized code is too small to be catched by JMH perfasm.
To observed if Min/MaxReductionVNode are created or not, I added a explicit print in ReductionNode::make, like:

--- a/src/hotspot/share/opto/vectornode.cpp
+++ b/src/hotspot/share/opto/vectornode.cpp
@@ -961,7 +961,9 @@ ReductionNode* ReductionNode::make(int opc, Node *ctrl, Node* n1, Node* n2, Basi
   case Op_MinReductionV:  return new MinReductionVNode(ctrl, n1, n2);
-  case Op_MaxReductionV:  return new MaxReductionVNode(ctrl, n1, n2);
+  case Op_MaxReductionV:
+    warning("in ReductionNode::make, making a MaxReductionVNode, length %d", n2->bottom_type()->is_vect()->length());
+    return new MaxReductionVNode(ctrl, n1, n2);
   case Op_AndReductionV:  return new AndReductionVNode(ctrl, n1, n2);

In my observation, we have Max4F when COUNT >= 4, it is resonable to create Max4F other than Max2F.
The Max2F is created with COUNT == 3 and -XX:-SuperWordLoopUnrollAnalysis.
But I did not find any noticeable improvements with such a small percentage.

The JMH has been updated, the performance results are:

Benchmark                              (COUNT_DOUBLE)  (COUNT_FLOAT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  677.778 ± 0.694  ns/op
VectorReductionFloatingMinMax.maxRedF             512              3       0  avgt   10   21.016 ± 0.097  ns/op
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  677.633 ± 0.664  ns/op
VectorReductionFloatingMinMax.minRedF             512              3       0  avgt   10   21.001 ± 0.019  ns/op
# Kunpeng 916, fmaxp/fminp
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  425.776 ± 0.785  ns/op
VectorReductionFloatingMinMax.maxRedF             512              3       0  avgt   10   20.883 ± 0.033  ns/op
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  426.177 ± 3.258  ns/op
VectorReductionFloatingMinMax.minRedF             512              3       0  avgt   10   20.871 ± 0.044  ns/op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try math.abs() for doubles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Math.abs(doublesA[i] - doublesB[i]) has ~36% improvements.
I updated the tests for doubles with Math.abs(), it looks more consistent. Thanks.
The JMH results of doubles with Math.abs():

Benchmark                              (COUNT_DOUBLE)  (COUNT_FLOAT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  681.319 ± 0.658  ns/op
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  682.596 ± 4.322  ns/op
# Kunpeng 916, fmaxp/fminp
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  439.130 ± 0.450  ns/op => 35.54%
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  439.105 ± 0.435  ns/op => 35.67%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-precision floating-point operands, as the experiments showed, we can have Max2F match only with COUNT == 3.
With such a small loop under superword framework, it is diffcult to tell how much improvements of fmaxp/fminp over fmaxv+ins.

Although it sounds unreasonable for an application to use Float64Vector rather than Float128Vecotor,
the optimization does being useful for VectorAPI Float64Vector.reduceLanes(VectorOperators.MAX) as mentioned previously.

Do you think we should remove single-precision floating-point parts in this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess we'll keep both. Even though the acceleration for single-precision float is disappointing on these cores, it might well be useful for some future processor, and I do care about the Vector API.

@openjdk
Copy link

openjdk bot commented Jan 12, 2021

@dgbo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 157 new commits pushed to the master branch:

  • 563b268: 8257709: C1: Double assignment in InstructionPrinter::print_stack
  • 400dc76: 8252015: [macos11] java.awt.TrayIcon requires updates for template images
  • ac2dee5: 8259539: JDK-8255711 broke trap messages
  • 4697cfa: 8259576: Misplaced curly brace in Matcher::find_shared_post_visit
  • a6ab9e4: 8258576: Try to get zerobased CCS if heap is above 32 and CDS is disabled
  • a3561ae: 8258243: C2: assert failed ("Bad derived pointer") with -XX:+VerifyRegisterAllocator
  • 4663704: 8259583: Remove unused decode_env::_codeBuffer
  • 9d4e84f: 8259565: Zero: compiler/runtime/criticalnatives fails because CriticalJNINatives is not supported
  • 98ccfbf: 8255710: Opensource unit/regression tests for CMM
  • 61c5b95: 7194219: java/awt/Component/UpdatingBootTime/UpdatingBootTime.html fails on Linux
  • ... and 147 more: https://git.openjdk.java.net/jdk/compare/57217b58bfd28c7ccbc0c3602e878c085dd865b9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 12, 2021
@dgbo
Copy link
Member Author

dgbo commented Jan 12, 2021

/integrate

@theRealAph Thanks for the review.

@pfustc @DamonFool Thank you for looking into this.
I see some aarch64 files created recently include Oracle copyright, e.g. aarch64_sve.ad.
And when I wrote the test code, I did take some files with Oracle copyright as references.
I think the Oracle copyright statements in testcode should be kept there.

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jan 12, 2021
@openjdk
Copy link

openjdk bot commented Jan 12, 2021

@dgbo
Your change (at version 4d6d32e) is now ready to be sponsored by a Committer.

@RealFYang
Copy link
Member

/sponsor

@openjdk openjdk bot closed this Jan 12, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jan 12, 2021
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jan 12, 2021
@openjdk
Copy link

openjdk bot commented Jan 12, 2021

@RealFYang @dgbo Since your change was applied there have been 158 commits pushed to the master branch:

  • 4c75d14: 8259374: Make ThreadInVMfromNative have ResetNoHandleMark
  • 563b268: 8257709: C1: Double assignment in InstructionPrinter::print_stack
  • 400dc76: 8252015: [macos11] java.awt.TrayIcon requires updates for template images
  • ac2dee5: 8259539: JDK-8255711 broke trap messages
  • 4697cfa: 8259576: Misplaced curly brace in Matcher::find_shared_post_visit
  • a6ab9e4: 8258576: Try to get zerobased CCS if heap is above 32 and CDS is disabled
  • a3561ae: 8258243: C2: assert failed ("Bad derived pointer") with -XX:+VerifyRegisterAllocator
  • 4663704: 8259583: Remove unused decode_env::_codeBuffer
  • 9d4e84f: 8259565: Zero: compiler/runtime/criticalnatives fails because CriticalJNINatives is not supported
  • 98ccfbf: 8255710: Opensource unit/regression tests for CMM
  • ... and 148 more: https://git.openjdk.java.net/jdk/compare/57217b58bfd28c7ccbc0c3602e878c085dd865b9...master

Your commit was automatically rebased without conflicts.

Pushed as commit ccac7aa.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dgbo dgbo deleted the aarch64_optimize_reduce_fmax branch January 12, 2021 13:28
@kimbarrett
Copy link

All linux-aarch64 builds have started failing in Oracle's CI testing after this change was pushed.
See https://bugs.openjdk.java.net/browse/JDK-8259629

@dgbo
Copy link
Member Author

dgbo commented Jan 12, 2021

All linux-aarch64 builds have started failing in Oracle's CI testing after this change was pushed.
See https://bugs.openjdk.java.net/browse/JDK-8259629

I'm really sorry for producing a serious BUG here.
I have raise a PR [1] to fix this. Would you mind having a look? Thanks.

[1] #2052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants