-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8292697: ZGC: Hangs when almost out of metaspace memory #9985
8292697: ZGC: Hangs when almost out of metaspace memory #9985
Conversation
/label add hotspot |
👋 Welcome back stefank! A progress list of the required criteria for merging this PR into |
@stefank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thanks Erik! |
@stefank This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 21 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ok. The comments help describe this unexpected interaction.
@@ -145,6 +145,7 @@ private void go() { | |||
// step 2: try to load one more class | |||
// it should be impossible | |||
try { | |||
log("just a tiny mint"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted extra logging to see if we reached this statement (we didn't). And I though this line fitted the context, similar to the other rather silly statements in the test (eatALittleMemory, "washing hands before meal"). I can change it if it bothers you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/just a tiny mint/just a wafer thin mint/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the laugh for my day!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should have looked up the exact quote. :)
|
||
if (is_first_unprocessed()) { | ||
// The first non-processed request takes ownership of triggering the GC | ||
// on behalf of itself, and all trailing requests in the list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a lambda? why can't it just be code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just style:
- I wanted to pull this logic out to a function/lambda
- I wanted to keep the code near its usage
- I didn't want all ceremony of having to declare the function / or creating a free function taking two extra parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did it need to be pulled out into a function? It's just a small amount of code that either is a lambda inline or just code inline? I'm a massive fan of lambdas but I don't see the purpose here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code gets easier to read if you separate out the two concerns:
- What is it you want to check
- What is it you want to do if that happens
I find that stye of code easier to read, and easier to add precise comments to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Fine, it's a matter of taste then.
Thanks for the reviews! |
Going to push as commit 3844685.
Your commit was automatically rebased without conflicts. |
@@ -995,23 +995,30 @@ const char* Metaspace::metadata_type_name(Metaspace::MetadataType mdtype) { | |||
} | |||
} | |||
|
|||
void Metaspace::purge() { | |||
void Metaspace::purge(bool classes_unloaded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not have been more logical to heave both MetaspaceCritical_lock
and MetaspaceCriticalAllocation::process();
out of this function? As it is, Metaspace::purge
, called with false, is now a no-op that only does things on behalf of the GC. It has nothing to do with the functioning of the allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean. On the other hand, if you take a look at Metaspace::allocate, it does quite a lot more than being an allocator, for example, it calls into the GC code (which performs the other side of the metaspace critical allocations).
I wouldn't be opposed if anyone wanted to restructure this.
HotSpot performs "critical metaspace allocations" when it's running out of metaspace. The failed allocations are registered in a queue, which the GC prioritizes when cleaning up the metaspace. There's a race in the code that handles these requests.
These requests are added to the queue, and the GC will "process" each request in turn when it runs the metaspace purge phase. The queue handling has an optimization that says that only the first request in the queue needs to trigger the GC, all subsequent requests in the queue will wait for that GC. When the GC gets to the purge phase it will mark in all requests that they have been processed. Note: that this doesn't mean that the request was satisfied, it could be that the result was NULL (and the thread will trigger a last-effort GC before it throws an OOME).
The bug is in the code that determines if a request is responsible for triggering a new GC. The current code just checks if the current request is first in the queue. This doesn't work if the code is called just after the GC has run purge, but before the old requests have been removed. The new request sees that there are already elements in the queue, so it doesn't trigger the GC. And at the same time, the old requests have been processed and they won't trigger the GC either. So, now the new request is waiting for a GC that will not be triggered by anyone.
Note: The reason why there's a delay between the GC processing a request, and the removal from the queue, is that the Java thread that added the request is also responsible for removing the request form the queue. The reason for this is that the mentioned last-effort GC, needs to be able to process the request a second time.
The fix-proposal is to let threads adding new requests check if the added request is the first non-processed request in the queue. If it is, that request/thread is responsible for triggering the GC for itself, and any subsequently added requests (until the GC runs the next round of request processing).
However, there's a snatch to this proposal. The request processing is done inside
Metaspace::purge()
, and that function is skipped if the GC didn't unload any classes. The proposed logic relies on that function to always be run when a GC is running. So, I've also changed so that the GC unconditionally calls the request processing. An alternative would be to always run the Metaspace::purge() code. That might even help return memory from temporarily allocated metaspace memory earlier, but I've left that exercise for a potential future improvement.I've also tweaked the test so that we get a bit more info if this test fails again.
Testing: I could reliably reproduce the original hang on my macbook laptop, within a few minutes. With this fix I can run the test in a loop for hours without reproducing the hang. I've tested this together with Generational ZGC code, running tier1-tier7 on Linux x64. I've started more extensive testing on openjdk/jdk.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9985/head:pull/9985
$ git checkout pull/9985
Update a local copy of the PR:
$ git checkout pull/9985
$ git pull https://git.openjdk.org/jdk pull/9985/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9985
View PR using the GUI difftool:
$ git pr show -t 9985
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9985.diff