-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8277212: GC accidentally cleans valid megamorphic vtable inline caches #6450
8277212: GC accidentally cleans valid megamorphic vtable inline caches #6450
Conversation
/label hotspot |
👋 Welcome back stefank! A progress list of the required criteria for merging this PR into |
@stefank |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@stefank This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 52 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job tracking this down! It does look like it was a merge error from the original code that's escaped notice until now. Well done!
@@ -478,6 +478,8 @@ bool CompiledMethod::clean_ic_if_metadata_is_dead(CompiledIC *ic) { | |||
} else { | |||
ShouldNotReachHere(); | |||
} | |||
} else { | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given up pretending to understand this code, but could you add a one line comment why you're returning true here? ie. if ic_metadata is NULL, it's a megamorphic call or already clean and shouldn't be cleaned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment.
I've updated the patch with a comment. Note that we perform an is_clean() check at the top of the function, so we know that the IC is not "clean" at the new return line. |
@@ -478,6 +478,8 @@ bool CompiledMethod::clean_ic_if_metadata_is_dead(CompiledIC *ic) { | |||
} else { | |||
ShouldNotReachHere(); | |||
} | |||
} else { | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch and great summary!
Thanks all for reviewing! |
/integrate |
Going to push as commit 976c2bb.
Your commit was automatically rebased without conflicts. |
@stefank Unknown command |
We got a report on the zgc-dev list about a large performance issue affecting ZGC:
https://mail.openjdk.java.net/pipermail/zgc-dev/2021-November/001086.html
One of the issues that the reporter identified was that we could get extremely long class unloading / unlinking times:
and while this were happening we got a huge number of ICBufferFull safepoints.
It turns out that we have a 10-year-old bug in the inline cache cleaning code. This code came in with the permgen removal. See how the original code only calls set_to_clean when ic_oop is non-null:
5c58d27
The rewritten code put the set_to_clean call in a different scope, causing the CompiledIC to also be cleaned when ic_oop is NULL:
Note the weird indentation, which could be seen as a hint that this might be a dubious / accidental change.
To understand why this is causing the problems we are seeing it's good to start by reading:
https://wiki.openjdk.java.net/display/HotSpot/Overview+of+CompiledIC+and+CompiledStaticCall
When the GC hits this path and finds an ic_oop that is NULL, it means that it is dealing with an inline cache that is a megamorphic vtable call (or clean). Those should not be cleaned (at least that wasn't the intention of the old code).
But now we do clean them, and to do so we use an ICStub (to make a safe transition to the clean state), which uses up slots in the ICBuffer. When the ICBuffer is full, concurrent GCs have to stop and schedule an ICBufferFull safepoint stop-the-world operation, which removes the ICStub from the inline cache and completely frees up the ICBuffer. If the GC cleans a lot of these megamorphic vtable inline caches, then we'll create a large number of ICBufferFull safepoints.
But it is even worse than that. After the class unloading GCs have destroyed all megamorphic vtable inline caches, the Java threads will see these cleaned inline caches and correct them. Correcting the cleaned inline caches from the Java threads will also use ICStubs, and eventually the inline caches will transition back to be a megamorphic vtable calls. Because of this we can end up in a situation where the GC and Java threads change the inline cache back and forth between clean and megamorphic vtable calls. When this happen both GC and Java threads will continuously schedule ICBufferFull safepoints, and this can go on for many seconds, even minutes, if we are unlucky. For ZGC this has the effect that it blocks any further GC work, and eventually the Java threads will run out of memory and hit allocation stalls. The Java threads will then wait for the GC to "clean" all inline caches and exit the class unloading phase and proceed to the phase where memory is reclaimed. You can see in the GC logs that even though the problematic unlinking phase goes on for many seconds, the allocation stalls are "only" a few hundred milliseconds. This shows that when the Java threads stop fighting over the inline caches, the GC can finish the work relatively quickly.
G1 performs the inline cache cleaning while the Java threads are stopped, and therefore don't have to use ICStubs when the megamorphic vtables are accidentally cleaned. So, G1 (and other stop-the-world class unloading GCs) won't enter the situation where the GC and Java thread concurrently fight over the inline caches. It still causes the Java threads to have to take a slow path and fix the inline caches, which can result in unnecessary ICBufferFull safepoints.
I been able to reproduce the issue where ZGC and the Java threads fight over the ICStubs, causing minute long unloading times, by running one of the microbenchmarks from the Blackbird library used by the reporter of this issue. See description in:
https://mail.openjdk.java.net/pipermail/zgc-dev/2021-November/001096.html
I think this could be reproduced in other workloads as well. I've also been able to reproduce the excessive ICBufferFull safepoints with Kitchensink (an oracle-internal stress test).
I've verified that restoring the set_to_clean code to the right scope fixes the issue that I can reproduce with both Blackbird and Kitchensink. After the fix, the class unloading times go back to normal levels.
To identify this issue, it's good to run with -Xlog:gc*,safepoint and take note of the "Concurrent Process Non-Strong References" times and ICBufferFull safepoint lines.
Example logs from ZGC where concurrent cleaning causes ICBufferFull safepoints:
Example logs from G1 where the Java threads fixes the cleaned inline caches and run out of ICStubs:
I've tested the performance of the change with SPECjbb2015, SPECjvm2008, DaCapo, Renaissance.
I've tested run the patch through tier1-7.
Note that I've made patch as small as possible to make it easier to backport. Thanks @fisk for discussion and explanation of the inline caches code.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6450/head:pull/6450
$ git checkout pull/6450
Update a local copy of the PR:
$ git checkout pull/6450
$ git pull https://git.openjdk.java.net/jdk pull/6450/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 6450
View PR using the GUI difftool:
$ git pr show -t 6450
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6450.diff