-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8331117: [PPC64] secondary_super_cache does not scale well #19368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8331117: [PPC64] secondary_super_cache does not scale well #19368
Conversation
|
👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into |
|
@TheRealMDoerr This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 332 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@TheRealMDoerr The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
@theRealAph: It would be great if you could take a look and see if you can spot any bug. Especially, I wonder why |
Why would it not be zero? Some classes don't have secondary super types. |
Run all of tier1 with |
I had to check for jdk/src/hotspot/cpu/x86/macroAssembler_x86.cpp Line 4967 in be1d374
|
Right, it's surprisingly slow regardless if the patch is applied or not. My x86 machine is about 8x faster. The PPC64 machine isn't optimized for single thread performance. It's configured to use SMT8 (8 threads per core). I guess s390 will achieve better single thread performance. |
No, because we already checked that there must be something to look at before calling the slow path. Invariant: array_length == popcount(bitmap)
If there's a bit set in the bitmap, then there must be a corresponding entry in the array. If we get to the slow path there must be at least two bits set in the bitmap. Therefore, at this point, array_length >= 2. |
|
Why is this a compare, not a bit test? |
|
Thank you so much for finding my bug! This explains why I got array_lenght 0. Fixed and added assertion (see 2nd commit). |
|
Performance seems to be not affected by that bug. Note that I have used #19427 to run TypePollution micro benchmarks. |
|
/label add hotspot-compiler |
|
@TheRealMDoerr |
That is extremely suspicious. |
|
|
||
| li(result, 1); // failure | ||
| // We test the MSB of r_array_index, i.e. its sign bit | ||
| bgt(CCR0, L_fallthrough); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. Should be greater or equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Fixed. Thank you!
|
@TheRealMDoerr I got one test failure on PPC with these changes: diff --git a/src/hotspot/share/runtime/globals.hpp b/src/hotspot/share/runtime/globals.hpp
index 6bfb260606b..70897a1066e 100644
--- a/src/hotspot/share/runtime/globals.hpp
+++ b/src/hotspot/share/runtime/globals.hpp
@@ -1988,13 +1988,13 @@ const int ObjectAlignmentInBytes = 8;
"rewriting/transformation independently of the JVMTI " \
"can_{retransform/redefine}_classes capabilities.") \
\
- product(bool, UseSecondarySupersCache, true, DIAGNOSTIC, \
+ product(bool, UseSecondarySupersCache, false, DIAGNOSTIC, \
"Use secondary supers cache during subtype checks.") \
\
- product(bool, UseSecondarySupersTable, false, DIAGNOSTIC, \
+ product(bool, UseSecondarySupersTable, true, DIAGNOSTIC, \
"Use hash table to lookup secondary supers.") \
\
- product(bool, VerifySecondarySupers, false, DIAGNOSTIC, \
+ product(bool, VerifySecondarySupers, true, DIAGNOSTIC, \
"Check that linear and hashed secondary lookups return the same result.") \
\
product(bool, StressSecondarySupers, false, DIAGNOSTIC, \But if I revert the changes I had done, then it passes. Same situation I'm facing on s390x. Is this expected ? failure log: |
|
That doesn't look like a platform specific thing. I'm getting the same result on x86_64. @theRealAph: Is that a known limitation or is it worth a new JBS issue? |
| // points to the length, we don't need to adjust it to point to the | ||
| // data. | ||
| assert(Array<Klass*>::base_offset_in_bytes() == wordSize, "Adjust this code"); | ||
| assert(Array<Klass*>::length_offset_in_bytes() == 0, "Adjust this code"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the assertion for Array<Klass*>::length_offset_in_bytes() is needed.
Isn't it sufficient to assert Array<Klass*>::base_offset_in_bytes() == wordSize?
What would break if Array<Klass*>::length_offset_in_bytes() == 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is pointless. I've removed it.
reinrich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Martin,
thanks for the port. It looks good. I've only got a few minor comments.
Cheers, Richard.
| u1 super_klass_slot) { | ||
| assert_different_registers(r_sub_klass, r_super_klass, temp1, temp2, temp3, temp4, result); | ||
|
|
||
| Label L_fallthrough; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L_done would be a better name.
| beq(CCR0, L_fallthrough); // (result != 0) | ||
|
|
||
| // Linear probe. Rotate the bitmap so that the next bit to test is | ||
| // in Bit 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's bit 2 that's tested next after the rotation, isn't it? See L2331 in lookup_secondary_supers_table_slow_path
| // in Bit 1. | |
| // in Bit 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's rather confusing language. In fact, the bit we just tested is in Bit 1.
|
|
||
| LOOKUP_SECONDARY_SUPERS_TABLE_REGISTERS; | ||
|
|
||
| Label L_fallthrough; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L_done would be a better name.
|
|
||
| ldx(result, r_array_base, r_array_index); | ||
| xor_(result, result, r_super_klass); | ||
| beq(CCR0, L_fallthrough); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might add a comment success (result == 0).
src/hotspot/cpu/ppc/ppc.ad
Outdated
| bool success = false; | ||
| u1 super_klass_slot = ((Klass*)$super_con$$constant)->hash_slot(); | ||
| if (InlineSecondarySupersTest) { | ||
| success = __ lookup_secondary_supers_table($sub$$Register, $super_reg$$Register, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
success is always true. Can it be removed?
| } while(0) | ||
|
|
||
| // Return true: we succeeded in generating this code | ||
| bool MacroAssembler::lookup_secondary_supers_table(Register r_sub_klass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method always returns true. Should even return a value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there to communicate failure, if there was any. Some ports can fail to generate code because of space exhaustion, and we need to communicate this to the caller.
|
Thanks for reviewing! Your suggestions are all good (see my update). |
reinrich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Thanks, Richard.
|
Thanks for the review! |
| assert(r_super_klass == R3_ARG1 && \ | ||
| r_array_base == R4_ARG2 && \ | ||
| r_array_length == R5_ARG3 && \ | ||
| (r_array_index == R6_ARG4 || r_array_index == noreg) && \ | ||
| (r_sub_klass == R7_ARG5 || r_sub_klass == noreg) && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can set r_super_klass = R5 and r_sub_klass =R7 to keep consistency in c1_Runtime1_ppc.cpp:
case slow_subtype_check_id:
{ // Support for uint StubRoutine::partial_subtype_check( Klass sub, Klass super );
const Register sub_klass = R5,
super_klass = R4,
temp1_reg = R6,
temp2_reg = R0;
__ check_klass_subtype_slow_path(sub_klass, super_klass, temp1_reg, temp2_reg); // returns with CR0.eq if successful
__ crandc(CCR0, Assembler::equal, CCR0, Assembler::equal); // failed: CR0.ne
__ blr();
}
break;I can see this being done for aarch64, x86 and risc-v as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, using the same registers for sub_klass and super_klass as C1 should do no harm. See latest commit.
|
I've added a check for Power7 because JDK-8331859 is not yet implemented. It should get removed by that one again. I think we should have this check in case this PR gets backported. |
I've never seen this. It must be a regression. I'll have a look. |
Ah, I see. The test is doing some IR node counts for Klass loads, and |
offamitkumar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used it as a base to implement s390x changes, so gone through it multiple times; Also ran tier1 test ( on power 8 machine) with -XX:+UseSecondarySupersTable -XX:+VerifySecondarySupers -XX:+StressSecondarySupers;
LGTM.
|
Thanks for all reviews, testing and comments! |
|
Going to push as commit 0d1080d.
Your commit was automatically rebased without conflicts. |
|
@TheRealMDoerr Pushed as commit 0d1080d. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Maybe file an issue and ask the IR test folks to take a look? I don't think this PR is a good place to discuss it. |
I think the |
The tests are not all expected to work with arbitrary combinations of -XX arguments. In many cases, they will surely fail. I'll be working on a followup patch that will remove all uses of |
| (result == R8_ARG6 || result == noreg), "registers must match ppc64.ad"); \ | ||
| } while(0) | ||
|
|
||
| // Return true: we succeeded in generating this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have removed the return value but not its comment. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. I think we can live with it :-)
Sorry for necro-posting, but I saw that there had never been a reply to this one. The IR tests that are faliing count the number of CMP nodes in a type check. When we disable the use of secondary_super_cache in C2, we reduce the number of CMP nodes, because we are no longer checking the secondary_super_cache. This is failure OK for now, because it never triggers without diagnostic VM options, but when we remove the secondary_super_cache altogether this test will have to be revised. |
PPC64 implementation of JDK-8180450. Please review!
I noticed that
r_array_lengthis sometimes 0 and I don't see code for that on x86. Any idea? (This has been addressed in the discussion.)How can we verify it? By comparing the performance using the micro benchmarks?
Micro benchmark results without patch (measured on Power10 with 2*8 hardware threads):
With patch:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19368/head:pull/19368$ git checkout pull/19368Update a local copy of the PR:
$ git checkout pull/19368$ git pull https://git.openjdk.org/jdk.git pull/19368/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19368View PR using the GUI difftool:
$ git pr show -t 19368Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19368.diff
Webrev
Link to Webrev Comment