Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8292697: ZGC: Hangs when almost out of metaspace memory #9985

Conversation

stefank
Copy link
Member

@stefank stefank commented Aug 23, 2022

HotSpot performs "critical metaspace allocations" when it's running out of metaspace. The failed allocations are registered in a queue, which the GC prioritizes when cleaning up the metaspace. There's a race in the code that handles these requests.

These requests are added to the queue, and the GC will "process" each request in turn when it runs the metaspace purge phase. The queue handling has an optimization that says that only the first request in the queue needs to trigger the GC, all subsequent requests in the queue will wait for that GC. When the GC gets to the purge phase it will mark in all requests that they have been processed. Note: that this doesn't mean that the request was satisfied, it could be that the result was NULL (and the thread will trigger a last-effort GC before it throws an OOME).

The bug is in the code that determines if a request is responsible for triggering a new GC. The current code just checks if the current request is first in the queue. This doesn't work if the code is called just after the GC has run purge, but before the old requests have been removed. The new request sees that there are already elements in the queue, so it doesn't trigger the GC. And at the same time, the old requests have been processed and they won't trigger the GC either. So, now the new request is waiting for a GC that will not be triggered by anyone.

Note: The reason why there's a delay between the GC processing a request, and the removal from the queue, is that the Java thread that added the request is also responsible for removing the request form the queue. The reason for this is that the mentioned last-effort GC, needs to be able to process the request a second time.

The fix-proposal is to let threads adding new requests check if the added request is the first non-processed request in the queue. If it is, that request/thread is responsible for triggering the GC for itself, and any subsequently added requests (until the GC runs the next round of request processing).

However, there's a snatch to this proposal. The request processing is done inside Metaspace::purge(), and that function is skipped if the GC didn't unload any classes. The proposed logic relies on that function to always be run when a GC is running. So, I've also changed so that the GC unconditionally calls the request processing. An alternative would be to always run the Metaspace::purge() code. That might even help return memory from temporarily allocated metaspace memory earlier, but I've left that exercise for a potential future improvement.

I've also tweaked the test so that we get a bit more info if this test fails again.

Testing: I could reliably reproduce the original hang on my macbook laptop, within a few minutes. With this fix I can run the test in a loop for hours without reproducing the hang. I've tested this together with Generational ZGC code, running tier1-tier7 on Linux x64. I've started more extensive testing on openjdk/jdk.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8292697: ZGC: Hangs when almost out of metaspace memory

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9985/head:pull/9985
$ git checkout pull/9985

Update a local copy of the PR:
$ git checkout pull/9985
$ git pull https://git.openjdk.org/jdk pull/9985/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9985

View PR using the GUI difftool:
$ git pr show -t 9985

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9985.diff

@stefank
Copy link
Member Author

stefank commented Aug 23, 2022

/label add hotspot

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 23, 2022

👋 Welcome back stefank! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Aug 23, 2022
@openjdk
Copy link

openjdk bot commented Aug 23, 2022

@stefank
The hotspot label was successfully added.

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@stefank
Copy link
Member Author

stefank commented Aug 24, 2022

Thanks Erik!

@openjdk
Copy link

openjdk bot commented Aug 24, 2022

@stefank This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8292697: ZGC: Hangs when almost out of metaspace memory

Reviewed-by: eosterlund, coleenp

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 21 new commits pushed to the master branch:

  • 7b81a9c: 8289764: gc/lock tests failed with "OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects"
  • 76ee549: 8292329: Enable CDS shared heap for zero builds
  • 14623c6: 8292739: Invalid legacy entries may be returned by Provider.getServices() call
  • 568be58: 8290469: Add new positioning options to PassFailJFrame test framework
  • 69448f9: 8292679: Simplify thread creation in gtest and port 2 tests to new way
  • 3c2289d: 8215916: The failure reason of an optional JAAS LoginModule is not logged
  • 71ab5c9: 8292816: GPL Classpath exception missing from assemblyprefix.h
  • c062397: 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses
  • a45a4b9: 8292194: G1 nmethod entry barrier disarm value wraps around too early
  • d3fed12: 8292864: Remove resourceHash.hpp from header files
  • ... and 11 more: https://git.openjdk.org/jdk/compare/cf0067741249cc3260b1d220769dac408b614f22...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 24, 2022
@mlbridge
Copy link

mlbridge bot commented Aug 24, 2022

Webrevs

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok. The comments help describe this unexpected interaction.

@@ -145,6 +145,7 @@ private void go() {
// step 2: try to load one more class
// it should be impossible
try {
log("just a tiny mint");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted extra logging to see if we reached this statement (we didn't). And I though this line fitted the context, similar to the other rather silly statements in the test (eatALittleMemory, "washing hands before meal"). I can change it if it bothers you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, it's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/just a tiny mint/just a wafer thin mint/

https://en.wikipedia.org/wiki/Mr_Creosote

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the laugh for my day!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should have looked up the exact quote. :)


if (is_first_unprocessed()) {
// The first non-processed request takes ownership of triggering the GC
// on behalf of itself, and all trailing requests in the list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a lambda? why can't it just be code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just style:

  1. I wanted to pull this logic out to a function/lambda
  2. I wanted to keep the code near its usage
  3. I didn't want all ceremony of having to declare the function / or creating a free function taking two extra parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did it need to be pulled out into a function? It's just a small amount of code that either is a lambda inline or just code inline? I'm a massive fan of lambdas but I don't see the purpose here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code gets easier to read if you separate out the two concerns:

  1. What is it you want to check
  2. What is it you want to do if that happens

I find that stye of code easier to read, and easier to add precise comments to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Fine, it's a matter of taste then.

@stefank
Copy link
Member Author

stefank commented Aug 26, 2022

Thanks for the reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Aug 26, 2022

Going to push as commit 3844685.
Since your change was applied there have been 42 commits pushed to the master branch:

  • 2b80349: 8292669: IGV: Search not useable when in Overflow Toolbar
  • 4732abf: 8292910: Remove root to oop cast in HeapDumper
  • f694f8a: 8292083: Detected container memory limit may exceed physical machine memory
  • f91943c: 8292868: Explicitly pass a third temp register to MacroAssembler::store_heap_oop for aarch64
  • 6354a57: 8290711: assert(false) failed: infinite loop in PhaseIterGVN::optimize
  • 3e18773: 8292880: Improve debuggee logging for com/sun/jdi/ClassUnloadEventTest.java
  • 909e1ed: 8292919: Build failure due to UseJVMCICompiler was not declared when C2 is disabled after JDK-8292691
  • 55f5a83: 8282410: Remove SA ProcDebugger support
  • d83faea: 8292250: Create test for co-located JDI MethodEntry, Step, and Breakpoint events
  • e353b57: 8292890: Remove PrintTouchedMethodsAtExit and LogTouchedMethods
  • ... and 32 more: https://git.openjdk.org/jdk/compare/cf0067741249cc3260b1d220769dac408b614f22...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 26, 2022
@openjdk openjdk bot closed this Aug 26, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 26, 2022
@openjdk
Copy link

openjdk bot commented Aug 26, 2022

@stefank Pushed as commit 3844685.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@@ -995,23 +995,30 @@ const char* Metaspace::metadata_type_name(Metaspace::MetadataType mdtype) {
}
}

void Metaspace::purge() {
void Metaspace::purge(bool classes_unloaded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not have been more logical to heave both MetaspaceCritical_lock and MetaspaceCriticalAllocation::process(); out of this function? As it is, Metaspace::purge, called with false, is now a no-op that only does things on behalf of the GC. It has nothing to do with the functioning of the allocator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean. On the other hand, if you take a look at Metaspace::allocate, it does quite a lot more than being an allocator, for example, it calls into the GC code (which performs the other side of the metaspace critical allocations).

I wouldn't be opposed if anyone wanted to restructure this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants