-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8318723: RISC-V: C2 UDivL #16346
8318723: RISC-V: C2 UDivL #16346
Conversation
👋 Welcome back mli! A progress list of the required criteria for merging this PR into |
@Hamlin-Li The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
div(result, rs1, rs2); | ||
} else { | ||
Label Lltz, Ldone; | ||
bltz(rs2, Lltz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what this bltz
branch is for. Is this a minor performance tunning here? And How would this make a difference then if that's true? I didn't see much difference from the LongDivMod.testDivideUnsigned negative
jmh test result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. It's also the only test case where there is a regression on the JMH numbers, or at least not a clear improvement (before: 6385.280, after: 6433.223)
On your JMH numbers, how many iterations have you run for each benchmark? I don't see the standard deviation which would be useful to better understand noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the algorithm details, check j.l.Long::divideUnsigned
in the jdk lib source, it mentions this algorithm, I also pointed to it in this patch.
It's not related to the difference between negative and positive test cases, it's related to the cost of divxx instructions, compared to the lines between 2440 ~ 2443 in src/hotspot/cpu/riscv/macroAssembler_riscv.cpp, the divu cost for negative value is still very high.
int_def ALU_COST ( 100, 1 * DEFAULT_COST);
int_def BRANCH_COST ( 200, 2 * DEFAULT_COST);
int_def IDIVDI_COST ( 6600, 66 * DEFAULT_COST);
I have also re-run the benchmark with more warmup (5) and iteration (10), please check the data in pr desc.
I also attach the diff between v1 and v2 intrinsic. v2 is this patch. v1 is diff based on v2, it just use riscv divxx directly without optimization for negative value brong by the algorithm (i.e. without the bltz and related other codes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the previous jmh data has no error
part, maybe because it's too low to show.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, with the branch, the results are 6376.674 ± 16.869 ns/op
, and without the branch, they are 29518.033 ± 49.056
, correct? If so, the branch makes more sense, at least of the board you've tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for negtive divisor, unsigned div can go through a quick path which is must faster than built-in instructions.
And this is demonstrated in the div cost in riscv.ad, and also verified by benchmark tests run on the board.
( I'm not sure if in the future built-in div will be faster, if it turns out in the future, we should also need to redefine the div cost in riscv.ad, and re-visit this intrinsic. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so let's keep the version with branch. You should add a comment at https://github.com/openjdk/jdk/pull/16346/files#diff-7a5c3ed05b6f3f06ed1c59f5fc2a14ec566a6a5bd1d09606115767daa99115bdR2435 explaining just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you're putting a conditional branch in the way of the common cases, and you're greatly increasing icache pressure, for the sake of a rare case. How does that make any sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @theRealAph here, this seems like a micro optimisation for the sake of microbenchmark that will not be beneficial in general. And if a considerable portion of divisors does lie in this range, the optimisation can always be applied from the caller side. Furthermore, by doing so we will even have the benefit of branch profiling, which will help achieve better results. Another note is that I do not know any compiler that does this premature optimisation. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @theRealAph @merykitty for the comments, I agree with you.
Have used the divu instead of introducing a cond branch here. And we will consider the regression of negative (also mixed) test cases as rare case, so just make sure the common case (positve one) get optimized.
@@ -241,9 +241,9 @@ class MacroAssembler: public Assembler { | |||
|
|||
// idiv variant which deals with MINLONG as dividend and -1 as divisor | |||
int corrected_idivl(Register result, Register rs1, Register rs2, | |||
bool want_remainder); | |||
bool want_remainder, bool is_signed = true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you not set the default value of is_signed
to true
, to make it clear which case it is at the callsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I use a default value for is_signed
is because both corrected_idivx are also used in cpu/riscv/c1_LIRAssembler_arith_riscv.cpp, which I dont' want to touch in this pr.
But if you insist, I can remove the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the default values.
/solves JDK-8318224 |
@Hamlin-Li |
So I tried this on Hifive Unmatched board. Unforunately, JMH test shows some regression for the LongDivMod.testDivideUnsigned Before:
After:
|
…instead of default value(true)
Thanks @RealFYang for testing, I have used the divu instead of introducing the cond branch, and will consider negative case as rare case, only make sure positive get optimized. |
But that case is going to be rare.The larger a number it is, the less common it is. The uniform distribution of this benchmark, in which 0 is as common as 0xb43a61c853a2af20, is grossly unrepresentative of real-world divisors. In practice, numbers follow some kind of log-normal distribution. Don't fall into the trap of optimizing for a benchmark. |
|
I have updated the test result too, not too much change as previous v1 implementation. |
src/hotspot/cpu/riscv/riscv.ad
Outdated
"not $dst, $dst\n\t" | ||
"and $dst, $dst, $src1\n\t" | ||
"srli $dst, $dst, 63\n\t" | ||
"Ldone:\t#@UdivL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to update this part to reflect the latest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it. Done.
@Hamlin-Li This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 126 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated change LGTM.
/integrate |
Going to push as commit 40a3c35.
Your commit was automatically rebased without conflicts. |
@Hamlin-Li Pushed as commit 40a3c35. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Hi,
Can you review the change to add intrinsic for UDivI and UDivL?
Thanks!
Tests
Functionality
Run tests successfully found via
grep -nr test/jdk/ -we divideUnsigned
andgrep -nr test/hotspot/ -we divideUnsigned
Performance
( NOTE: there are another 2 related issues: https://bugs.openjdk.org/browse/JDK-8318225, https://bugs.openjdk.org/browse/JDK-8318226, the pr of which will be subseqently sent out after this one finished. )
Long
NOTE: for positive divisor, it's the common case; for negative divisor, it's a rare case
Before
After
Integer
Before
After
/************ following is just backup: quick path for negative divisor *************/
Long
Before
After v1
(This is a simpler version, please check the diff from
After v2
below)After v2
(This is the current patch, This version has a huge regression for negative values!!!)
Diff of v1 from v2
Progress
Issues
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16346/head:pull/16346
$ git checkout pull/16346
Update a local copy of the PR:
$ git checkout pull/16346
$ git pull https://git.openjdk.org/jdk.git pull/16346/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16346
View PR using the GUI difftool:
$ git pr show -t 16346
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16346.diff
Webrev
Link to Webrev Comment