Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8318723: RISC-V: C2 UDivL #16346

Closed
wants to merge 6 commits into from
Closed

Conversation

Hamlin-Li
Copy link

@Hamlin-Li Hamlin-Li commented Oct 24, 2023

Hi,
Can you review the change to add intrinsic for UDivI and UDivL?
Thanks!

Tests

Functionality

Run tests successfully found via grep -nr test/jdk/ -we divideUnsigned and grep -nr test/hotspot/ -we divideUnsigned

Performance

( NOTE: there are another 2 related issues: https://bugs.openjdk.org/browse/JDK-8318225, https://bugs.openjdk.org/browse/JDK-8318226, the pr of which will be subseqently sent out after this one finished. )

Long

NOTE: for positive divisor, it's the common case; for negative divisor, it's a rare case

Before

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   10  19684.873 ± 21.882  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   10  28853.041 ±  6.425  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   10   6367.239 ± 16.011  ns/op

After

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   10  22622.133 ±  7.158  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   10  15957.272 ±  3.174  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   10  29499.721 ± 10.404  ns/op

Integer

Before

IntegerDivMod.testDivideUnsigned                    1024          mixed  avgt   10  23397.267 ± 36.980  ns/op
IntegerDivMod.testDivideUnsigned                    1024       positive  avgt   10  16792.414 ±  5.869  ns/op
IntegerDivMod.testDivideUnsigned                    1024       negative  avgt   10  30184.357 ± 55.464  ns/op

After

IntegerDivMod.testDivideUnsigned                    1024          mixed  avgt   10  23210.437 ±  4.463  ns/op
IntegerDivMod.testDivideUnsigned                    1024       positive  avgt   10  16622.342 ±  4.047  ns/op
IntegerDivMod.testDivideUnsigned                    1024       negative  avgt   10  30013.414 ± 48.695  ns/op

/************ following is just backup: quick path for negative divisor *************/

Long

Before

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   10  19704.317 ± 64.078  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   10  28856.859 ± 14.901  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   10   6364.974 ±  2.465  ns/op

After v1
(This is a simpler version, please check the diff from After v2 below)

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   10  22668.228 ± 74.161  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   10  15966.320 ± 14.985  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   10  29518.033 ± 49.056  ns/op

After v2
(This is the current patch, This version has a huge regression for negative values!!!)

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   10  11432.738 ±  95.785  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   10  15969.044 ±  19.492  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   10   6376.674 ±  16.869  ns/op
Diff of v1 from v2
diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
index b96f7611133..dfb40e171e7 100644
--- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
+++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
@@ -2432,16 +2432,7 @@ int MacroAssembler::corrected_idivq(Register result, Register rs1, Register rs2,
     if (is_signed) {
       div(result, rs1, rs2);
     } else {
-      Label Lltz, Ldone;
-      bltz(rs2, Lltz);
       divu(result, rs1, rs2);
-      j(Ldone);
-      bind(Lltz); // For the algorithm details, check j.l.Long::divideUnsigned
-      sub(result, rs1, rs2);
-      notr(result, result);
-      andr(result, result, rs1);
-      srli(result, result, 63);
-      bind(Ldone);
     }
   } else {
     rem(result, rs1, rs2); // result = rs1 % rs2;

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16346/head:pull/16346
$ git checkout pull/16346

Update a local copy of the PR:
$ git checkout pull/16346
$ git pull https://git.openjdk.org/jdk.git pull/16346/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16346

View PR using the GUI difftool:
$ git pr show -t 16346

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16346.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 24, 2023

👋 Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 24, 2023
@openjdk
Copy link

openjdk bot commented Oct 24, 2023

@Hamlin-Li The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 24, 2023
@mlbridge
Copy link

mlbridge bot commented Oct 24, 2023

Webrevs

div(result, rs1, rs2);
} else {
Label Lltz, Ldone;
bltz(rs2, Lltz);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this bltz branch is for. Is this a minor performance tunning here? And How would this make a difference then if that's true? I didn't see much difference from the LongDivMod.testDivideUnsigned negative jmh test result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's also the only test case where there is a regression on the JMH numbers, or at least not a clear improvement (before: 6385.280, after: 6433.223)

On your JMH numbers, how many iterations have you run for each benchmark? I don't see the standard deviation which would be useful to better understand noise.

Copy link
Author

@Hamlin-Li Hamlin-Li Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the algorithm details, check j.l.Long::divideUnsigned in the jdk lib source, it mentions this algorithm, I also pointed to it in this patch.

It's not related to the difference between negative and positive test cases, it's related to the cost of divxx instructions, compared to the lines between 2440 ~ 2443 in src/hotspot/cpu/riscv/macroAssembler_riscv.cpp, the divu cost for negative value is still very high.

int_def ALU_COST             (  100,  1 * DEFAULT_COST);
int_def BRANCH_COST          (  200,  2 * DEFAULT_COST);
int_def IDIVDI_COST          ( 6600, 66 * DEFAULT_COST);

I have also re-run the benchmark with more warmup (5) and iteration (10), please check the data in pr desc.
I also attach the diff between v1 and v2 intrinsic. v2 is this patch. v1 is diff based on v2, it just use riscv divxx directly without optimization for negative value brong by the algorithm (i.e. without the bltz and related other codes).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the previous jmh data has no error part, maybe because it's too low to show.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, with the branch, the results are 6376.674 ± 16.869 ns/op, and without the branch, they are 29518.033 ± 49.056, correct? If so, the branch makes more sense, at least of the board you've tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for negtive divisor, unsigned div can go through a quick path which is must faster than built-in instructions.
And this is demonstrated in the div cost in riscv.ad, and also verified by benchmark tests run on the board.

( I'm not sure if in the future built-in div will be faster, if it turns out in the future, we should also need to redefine the div cost in riscv.ad, and re-visit this intrinsic. )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so let's keep the version with branch. You should add a comment at https://github.com/openjdk/jdk/pull/16346/files#diff-7a5c3ed05b6f3f06ed1c59f5fc2a14ec566a6a5bd1d09606115767daa99115bdR2435 explaining just that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're putting a conditional branch in the way of the common cases, and you're greatly increasing icache pressure, for the sake of a rare case. How does that make any sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @theRealAph here, this seems like a micro optimisation for the sake of microbenchmark that will not be beneficial in general. And if a considerable portion of divisors does lie in this range, the optimisation can always be applied from the caller side. Furthermore, by doing so we will even have the benefit of branch profiling, which will help achieve better results. Another note is that I do not know any compiler that does this premature optimisation. Thanks.

Copy link
Author

@Hamlin-Li Hamlin-Li Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theRealAph @merykitty for the comments, I agree with you.
Have used the divu instead of introducing a cond branch here. And we will consider the regression of negative (also mixed) test cases as rare case, so just make sure the common case (positve one) get optimized.

@@ -241,9 +241,9 @@ class MacroAssembler: public Assembler {

// idiv variant which deals with MINLONG as dividend and -1 as divisor
int corrected_idivl(Register result, Register rs1, Register rs2,
bool want_remainder);
bool want_remainder, bool is_signed = true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you not set the default value of is_signed to true, to make it clear which case it is at the callsite.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I use a default value for is_signed is because both corrected_idivx are also used in cpu/riscv/c1_LIRAssembler_arith_riscv.cpp, which I dont' want to touch in this pr.
But if you insist, I can remove the default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the default values.

@Hamlin-Li
Copy link
Author

/solves JDK-8318224

@openjdk
Copy link

openjdk bot commented Oct 25, 2023

@Hamlin-Li
Adding additional issue to solves list: 8318224: RISC-V: C2 UDivI.

@RealFYang
Copy link
Member

So I tried this on Hifive Unmatched board. Unforunately, JMH test shows some regression for the LongDivMod.testDivideUnsigned negative case.

Before:

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   15  24909.748 ?  17.915  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   15  36257.181 ?  33.615  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   15   6720.904 ?   8.522  ns/op        <====

After:

LongDivMod.testDivideUnsigned                    1024          mixed  avgt   15  13650.002 ?  52.788  ns/op
LongDivMod.testDivideUnsigned                    1024       positive  avgt   15  18784.942 ?  18.258  ns/op
LongDivMod.testDivideUnsigned                    1024       negative  avgt   15   7168.625 ?  17.019  ns/op        <====

@Hamlin-Li
Copy link
Author

Thanks @RealFYang for testing, I have used the divu instead of introducing the cond branch, and will consider negative case as rare case, only make sure positive get optimized.

@theRealAph
Copy link
Contributor

So I tried this on Hifive Unmatched board. Unforunately, JMH test shows some regression for the LongDivMod.testDivideUnsigned negative case.

But that case is going to be rare.The larger a number it is, the less common it is. The uniform distribution of this benchmark, in which 0 is as common as 0xb43a61c853a2af20, is grossly unrepresentative of real-world divisors.

In practice, numbers follow some kind of log-normal distribution. Don't fall into the trap of optimizing for a benchmark.

@Hamlin-Li
Copy link
Author

Don't fall into the trap of optimizing for a benchmark. Agree

@Hamlin-Li
Copy link
Author

I have updated the test result too, not too much change as previous v1 implementation.

"not $dst, $dst\n\t"
"and $dst, $dst, $src1\n\t"
"srli $dst, $dst, 63\n\t"
"Ldone:\t#@UdivL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to update this part to reflect the latest changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. Done.

@openjdk
Copy link

openjdk bot commented Oct 26, 2023

@Hamlin-Li This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8318723: RISC-V: C2 UDivL
8318224: RISC-V: C2 UDivI

Reviewed-by: fyang, luhenry, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 126 new commits pushed to the master branch:

  • 9864951: 8318447: Move NMT source code to own subdirectory
  • 744e089: 8318700: MacOS Zero cannot run gtests due to wrong JVM path
  • ec1bf23: 8318801: Parallel: Remove unused verify_all_young_refs_precise
  • 3cea892: 8318805: RISC-V: Wrong comments instructions cost in riscv.ad
  • bc1ba24: 8316437: JFR: assert(!tl->has_java_buffer()) failed: invariant
  • 970cd20: 8318788: java/net/Socks/SocksSocketProxySelectorTest.java fails on machines with no IPv6 link-local addresses
  • 37c40a1: 8318705: [macos] ProblemList java/rmi/registry/multipleRegistries/MultipleRegistries.java
  • 723db2d: 8305321: Remove unused exports in java.desktop
  • 811b436: 8318720: G1: Memory leak in G1CodeRootSet after JDK-8315503
  • a542f73: 8318843: ProblemList java/lang/management/MemoryMXBean/CollectionUsageThreshold.java in Xcomp
  • ... and 116 more: https://git.openjdk.org/jdk/compare/47bb1a1cefa242c39c22a8f2aa08d7d357c260b9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 26, 2023
Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated change LGTM.

@Hamlin-Li
Copy link
Author

/integrate
@RealFYang @theRealAph @luhenry @merykitty Thanks for your reviewing.

@openjdk
Copy link

openjdk bot commented Oct 26, 2023

Going to push as commit 40a3c35.
Since your change was applied there have been 127 commits pushed to the master branch:

  • 3885dc5: 8318737: Fallback linker passes bad JNI handle
  • 9864951: 8318447: Move NMT source code to own subdirectory
  • 744e089: 8318700: MacOS Zero cannot run gtests due to wrong JVM path
  • ec1bf23: 8318801: Parallel: Remove unused verify_all_young_refs_precise
  • 3cea892: 8318805: RISC-V: Wrong comments instructions cost in riscv.ad
  • bc1ba24: 8316437: JFR: assert(!tl->has_java_buffer()) failed: invariant
  • 970cd20: 8318788: java/net/Socks/SocksSocketProxySelectorTest.java fails on machines with no IPv6 link-local addresses
  • 37c40a1: 8318705: [macos] ProblemList java/rmi/registry/multipleRegistries/MultipleRegistries.java
  • 723db2d: 8305321: Remove unused exports in java.desktop
  • 811b436: 8318720: G1: Memory leak in G1CodeRootSet after JDK-8315503
  • ... and 117 more: https://git.openjdk.org/jdk/compare/47bb1a1cefa242c39c22a8f2aa08d7d357c260b9...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 26, 2023
@openjdk openjdk bot closed this Oct 26, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 26, 2023
@openjdk
Copy link

openjdk bot commented Oct 26, 2023

@Hamlin-Li Pushed as commit 40a3c35.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@Hamlin-Li Hamlin-Li mentioned this pull request Oct 27, 2023
3 tasks
@Hamlin-Li Hamlin-Li deleted the divideUnsigned branch February 27, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants