Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8308340: C2: Idealize Fma nodes #14576

Closed
wants to merge 8 commits into from
Closed

Conversation

fg1417
Copy link

@fg1417 fg1417 commented Jun 21, 2023

Some platforms, like aarch64, ppc, and riscv, support fusing Math.fma(-a, b, c) or Math.fma(a, -b, c) by generating partially symmetric match rules like:

  match(Set dst (FmaF src3 (Binary (NegF src1) src2)));
  match(Set dst (FmaF src3 (Binary src1 (NegF src2))));

Since Fma is partially commutative, the patch is to convert Math.fma(-a, b, c) to Math.fma(b, -a, c) in gvn phase, making node patterns canonical. Then we can remove redundant rules.

Also, we should guarantee that C2 generates Fma nodes only on platforms supporting Fma instructions before matcher, so we can remove all predicate(UseFMA) for all Fma rules.

After the patch, the code size of libjvm.so on aarch64 platform decreased by 63.4k.

The patch passed all tier 1 - 3 on aarch64 and x86 platforms.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8308340: C2: Idealize Fma nodes (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14576/head:pull/14576
$ git checkout pull/14576

Update a local copy of the PR:
$ git checkout pull/14576
$ git pull https://git.openjdk.org/jdk.git pull/14576/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14576

View PR using the GUI difftool:
$ git pr show -t 14576

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14576.diff

Webrev

Link to Webrev Comment

Some platforms, like aarch64, ppc, and riscv, support fusing
`Math.fma(-a, b, c)` or `Math.fma(a, -b, c)` by generating
partially symmetric match rules like:

```
  match(Set dst (FmaF src3 (Binary (NegF src1) src2)));
  match(Set dst (FmaF src3 (Binary src1 (NegF src2))));
```

Since `Fma` is partially communitive, the patch is to convert
`Math.fma(-a, b, c)` to `Math.fma(b, -a, c)` in gvn phase,
making node patterns canonical. Then we can remove redundant
rules.

Also, we should guarantee that C2 generates `Fma` nodes only on
platforms supporting `Fma` instructions before matcher, so we
can remove all `predicate(UseFMA)` for all `Fma` rules.

After the patch, the code size of libjvm.so on aarch64 platform
decreased by 63.4k.

The patch passed all tier 1 - 3 on aarch64 and x86 platforms.
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 21, 2023

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 21, 2023
@openjdk
Copy link

openjdk bot commented Jun 21, 2023

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jun 21, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 21, 2023

Webrevs

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for UseFMA should be moved from c2compiler.cpp to Matcher::match_rule_supported in .ad files.
I see we have such check for Fma vectors in x86.ad but not for scalars.
Similar issue exist for other platforms.

@fg1417
Copy link
Author

fg1417 commented Jun 27, 2023

Check for UseFMA should be moved from c2compiler.cpp to Matcher::match_rule_supported in .ad files. I see we have such check for Fma vectors in x86.ad but not for scalars. Similar issue exist for other platforms.

@vnkozlov Thanks for your review! I updated it in the latest commit.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
You need second review.

@openjdk
Copy link

openjdk bot commented Jun 27, 2023

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308340: C2: Idealize Fma nodes

Reviewed-by: kvn, epeter

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 11 new commits pushed to the master branch:

  • 207bd00: 8313756: [BACKOUT] 8308682: Enhance AES performance
  • 823f5b9: 8308850: Change JVM options with small ranges that get -Wconversion warnings to 32 bits
  • 5bfb82e: 8314119: G1: Fix -Wconversion warnings in G1CardSetInlinePtr::card_pos_for
  • 06aa3c5: 8314118: Update JMH devkit to 1.37
  • 4164693: 8313372: [JVMCI] Export vmIntrinsics::is_intrinsic_available results to JVMCI compilers.
  • 049b55f: 8314019: Add gc logging to jdk/jfr/event/gc/detailed/TestZAllocationStallEvent.java
  • a39ed10: 8314116: C2: assert(false) failed: malformed control flow after JDK-8305636
  • 1de5bf1: 8314106: C2: assert(is_valid()) failed: must be valid after JDK-8305636
  • 5c91622: 8314117: RISC-V: Incorrect VMReg encoding in RISCV64Frame.java
  • 6bbcef5: 8313948: Remove unnecessary static fields defaultUpper/defaultLower in sun.net.PortConfig
  • ... and 1 more: https://git.openjdk.org/jdk/compare/ec0cc6300a02dd92b25d9072b8b3859dab583bbd...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 27, 2023
@fg1417
Copy link
Author

fg1417 commented Jun 28, 2023

Thanks for your review @vnkozlov .

I would appreciate it very much if some expert on ppc or riscv could help review it! Perhaps @RealFYang @reinrich

@fg1417
Copy link
Author

fg1417 commented Jul 5, 2023

Can I get a second review, please? Thanks.

@@ -1875,6 +1875,17 @@ Node* VectorLongToMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) {
return nullptr;
}

Node* FmaVNode::Ideal(PhaseGVN* phase, bool can_reshape) {
// We canonicalize the node by converting "(-a)*b+c" into "b*(-a)+c"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a little bit more please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

For vectorapi masked operations, like av.neg().lanewise(VectorOperators.FMA, bv, cv, mask), the inactive lanes of the output should save the first input of the node, so the inactive lanes of the output should be equal to lane values in av.neg(). If we exchange the inputs, the inactive lanes will be equal to bv, which is incorrect. So we shouldn't swap edges for masked nodes. The newly added testcases in jdk/test/hotspot/jtreg/compiler/vectorapi/VectorFusedMultiplyAddSubTest.java can cover this. Fortunately, there is no such constraint for non-masked vector nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think I understood it to some degree.
What happens with the subgraphs that are not canonicalized? They will have extra vector operations, right?

Copy link
Author

@fg1417 fg1417 Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. For av.neg().lanewise(VectorOperators.FMA, bv, cv, mask), the subgraph is like:
match (Set dst (FmaV (Binary (NegV src1) src2) (Binary src3 pg)));, almost no platform supports fusing it directly, so it should be split into two vector operations: NegV + FmaV. I suppose the NegV is what you called as "the extra vector operation", right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I meant. Thanks. Now on PPC, my understanding would be that with the symmetrical match-rules (removed with this pr) the NegV wouldn't be generated. Is my understanding correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. How are FmaV nodes with mask handled then? Are they transformed into equivalent nodes without mask?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is no handling on FmaV nodes with mask in this patch, whether in the C2 mid-end or codegen backend. The gvn transformation just skips them. And I suppose FmaV nodes with mask can't be transformed into nodes without mask, except that C2 can guarantee that the mask is all true (this transformation has not been supported by current C2). Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would have to dig deeper into the vector api implementation to really understand how it works.
Thanks for your patience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg1417 I only understood the comment with the help of your explanations in this thread. I think you should improve the comment. I would not mention the vectorapi. We may generate FmaV through an auto-vectorizer. Though I guess that is unlikely, since the scalar version Fma::Ideal would already reshape things.

Suggestion:

// We canonicalize the node by converting "(-a)*b+c" into "b*(-a)+c"
// This reduces the number of rules in the matcher, as we only need to check
// for negations on the second argument, and not the symmetric case where
// the first argument is negated.
// We cannot do this if he FmaV is masked. the inactive lanes have to return
// the first input (ie "-a"). If we were to swap the inputs, the inactive lanes would
// incorrectly return "b".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the patch to our nightly testing about 10 days ago and forgot about it (sorry).
So it passed several iterations of tier1-4 of hotspot and jdk, all of langtools and jaxp, renaissance benchmarks as functional tests. All testing with fastdebug and release builds on the main platforms and also on Linux/PPC64le.

PPC changes do look good to me. I'm not the greatest C2 expert though. So I'd suggest to get another review.

@fg1417
Copy link
Author

fg1417 commented Jul 18, 2023

I've added the patch to our nightly testing about 10 days ago and forgot about it (sorry). So it passed several iterations of tier1-4 of hotspot and jdk, all of langtools and jaxp, renaissance benchmarks as functional tests. All testing with fastdebug and release builds on the main platforms and also on Linux/PPC64le.

PPC changes do look good to me. I'm not the greatest C2 expert though. So I'd suggest to get another review.

Thanks a lot for your review and test work @reinrich!

Can I get a review from @TobiHartmann @chhagedorn @eme64 for C2 part? Thanks!

@RealFYang
Copy link
Member

RealFYang commented Jul 20, 2023

Thanks for your review @vnkozlov .

I would appreciate it very much if some expert on ppc or riscv could help review it! Perhaps @RealFYang @reinrich

Hello, the RISC-V part looks fine from what this PR is supposed to do. And this has passed tier1-3 tests on linux-riscv64 platform. Note that I didn't check the shared code changes.

@fg1417
Copy link
Author

fg1417 commented Jul 20, 2023

Hello, the RISC-V part looks fine from what this PR is supposed to do. And this has passed tier1-3 tests on linux-riscv64 platform. I didn't check the shared code changes.

@RealFYang Thanks for your review and test work!

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg1417 This looks like a reasonable refactoring.

We should probably do the verification that the canonicalization happened, if the normal fma matcher rule is chosen. We should add asserts that the first argument is not a negation (you could check the second argument also, just in case). What do you think?

@@ -60,7 +60,7 @@ public class VectorFusedMultiplyAddSubTest {
private static final VectorSpecies<Long> L_SPECIES = LongVector.SPECIES_MAX;
private static final VectorSpecies<Short> S_SPECIES = ShortVector.SPECIES_MAX;

private static int LENGTH = 1024;
private static int LENGTH = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for the reduction? Speed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for speeding up.

@@ -1875,6 +1875,17 @@ Node* VectorLongToMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) {
return nullptr;
}

Node* FmaVNode::Ideal(PhaseGVN* phase, bool can_reshape) {
// We canonicalize the node by converting "(-a)*b+c" into "b*(-a)+c"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg1417 I only understood the comment with the help of your explanations in this thread. I think you should improve the comment. I would not mention the vectorapi. We may generate FmaV through an auto-vectorizer. Though I guess that is unlikely, since the scalar version Fma::Ideal would already reshape things.

Suggestion:

// We canonicalize the node by converting "(-a)*b+c" into "b*(-a)+c"
// This reduces the number of rules in the matcher, as we only need to check
// for negations on the second argument, and not the symmetric case where
// the first argument is negated.
// We cannot do this if he FmaV is masked. the inactive lanes have to return
// the first input (ie "-a"). If we were to swap the inputs, the inactive lanes would
// incorrectly return "b".

//=============================================================================
//------------------------------Ideal------------------------------------------
Node* FmaNode::Ideal(PhaseGVN* phase, bool can_reshape) {
// We canonicalize the node by converting "(-a)*b+c" into "b*(-a)+c"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add motivation to comment

// This reduces the number of rules in the matcher, as we only need to check
// for negations on the second argument, and not the symmetric case where
// the first argument is negated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

@@ -3972,7 +3973,6 @@ instruct fmaD_reg(regD a, regD b, regD c) %{

// a * b + c
instruct fmaF_reg(regF a, regF b, regF c) %{
predicate(UseFMA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an assert to the encoding code. Just to ensure that we do not generate bad code, even if it is never executed during testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for your suggestion! Updated in the new commit.

@fg1417
Copy link
Author

fg1417 commented Aug 14, 2023

We should probably do the verification that the canonicalization happened, if the normal fma matcher rule is chosen. We should add asserts that the first argument is not a negation (you could check the second argument also, just in case). What do you think?

Hi @eme64, thanks for your review!

The check may be more complex than expected. Matcher can fuse two instructions into one, only when there is no other use for the inputs. It means that we can do the fusion for the case like:

return Math.fma(-a, b, c);

But we can't fuse them for the case like:

float tmp = -a;
return Math.fma(tmp, b, c) + (-a);

For the second case, we still match normal neg + fma rules separately, instead of these combined rules. So, we can't simply guarantee that the first argument is not a negation when the normal fma matcher rule is chosen. If considering the def-use while doing the instruction selection, the check may be complex. WDYT? Thanks.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg1417 fair enough. the checks may be too complex. I'll approve it now :)

@fg1417
Copy link
Author

fg1417 commented Aug 14, 2023

Thanks a lot for all your kind reviews and test work, @RealFYang @vnkozlov @eme64 @reinrich.

I'll integrate it.

@fg1417
Copy link
Author

fg1417 commented Aug 15, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Aug 15, 2023

Going to push as commit 37c6b23.
Since your change was applied there have been 23 commits pushed to the master branch:

  • 583cb75: 8313406: nep_invoker_blob can be simplified more
  • 0074b48: 8312597: Convert TraceTypeProfile to UL
  • 1f1c5c6: 8314241: Add test/jdk/sun/security/pkcs/pkcs7/SignerOrder.java to ProblemList
  • f142470: 8311981: Test gc/stringdedup/TestStringDeduplicationAgeThreshold.java#ZGenerational timed out
  • 595fdd3: 8314059: Remove PKCS7.verify()
  • 49b2984: 8313854: Some tests in serviceability area fail on localized Windows platform
  • c132176: 8114830: (fs) Files.copy fails due to interference from something else changing the file system
  • e56d3bc: 8313657: com.sun.jndi.ldap.Connection.cleanup does not close connections on SocketTimeoutErrors
  • 4b2703a: 8313678: SymbolTable can leak Symbols during cleanup
  • f41c267: 8314045: ArithmeticException in GaloisCounterMode
  • ... and 13 more: https://git.openjdk.org/jdk/compare/ec0cc6300a02dd92b25d9072b8b3859dab583bbd...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 15, 2023
@openjdk openjdk bot closed this Aug 15, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 15, 2023
@openjdk
Copy link

openjdk bot commented Aug 15, 2023

@fg1417 Pushed as commit 37c6b23.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants