Skip to content

8307958: Metaspace verification is slow causing extreme class unloading times#14084

Closed
xmas92 wants to merge 4 commits intoopenjdk:masterfrom
xmas92:JDK-8307958-VerifyMetaspace-Sometimes
Closed

8307958: Metaspace verification is slow causing extreme class unloading times#14084
xmas92 wants to merge 4 commits intoopenjdk:masterfrom
xmas92:JDK-8307958-VerifyMetaspace-Sometimes

Conversation

@xmas92
Copy link
Member

@xmas92 xmas92 commented May 22, 2023

From JBS:

We see very long class unloading times with Generational ZGC in fastdebug builds. MacOS spindumps show that it is caused by Metaspace verification code.

I think the main problem is that many of the verification operations are performed at every operation (add, remove, split, merge) and the individual operation verifies much more than a constant number of chunks. This seems to give a quadratic complexity which blows up if you don't class unload often enough.

The approach here is to resolve this by putting the verification behind the SOMETIMES macro. It is then possible to turn it of completely with -XX:VerifyMetaspaceInterval=0 while still benefiting from some stochastic verification.

The goal here is to have the calls to Verify unchanged. Leaving all explicit gtest verify and Universe::verify calls unaffected.
The verification was also left unchanged when committing the memory.

All SOMETIMES uses the same frequency, VerifyMetaspaceInterval. I looked at the possibility of adding weight to some of the call site, but it did not seem to be worth the complexity.

Testing: GHA and Oracle CI (tier1-3 ongoing), various long running stress tests.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8307958: Metaspace verification is slow causing extreme class unloading times

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14084/head:pull/14084
$ git checkout pull/14084

Update a local copy of the PR:
$ git checkout pull/14084
$ git pull https://git.openjdk.org/jdk.git pull/14084/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14084

View PR using the GUI difftool:
$ git pr show -t 14084

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14084.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 22, 2023

👋 Welcome back aboldtch! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 22, 2023
@openjdk
Copy link

openjdk bot commented May 22, 2023

@xmas92 The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 22, 2023
@mlbridge
Copy link

mlbridge bot commented May 22, 2023

Webrevs

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.

Could you please crank up the verification interval for metaspace gtests from 3 to 1? (See test/hotspot/jtreg/gtest/MetaspaceGtests.java).

@xmas92
Copy link
Member Author

xmas92 commented May 22, 2023

Seems reasonable.

Could you please crank up the verification interval for metaspace gtests from 3 to 1? (See test/hotspot/jtreg/gtest/MetaspaceGtests.java).

Yes, that is much better than changing the test helper.
Done.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. To be clear, was most of the time spent in the ChunkManager::get_chunk() and return_chunk(), and maybe purge() calls, and not much in the split and merge chunks? I agree with adding SOMETIMES to all for consistency, but I hope we're not doing more of these operations that I assume should be infrequent.

@openjdk
Copy link

openjdk bot commented May 22, 2023

@xmas92 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8307958: Metaspace verification is slow causing extreme class unloading times

Reviewed-by: stuefe, coleenp

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 147 new commits pushed to the master branch:

  • bacf652: 8297887: Update Siphash
  • 89f2d45: 8308237: add JDWP and JDI virtual thread support for ThreadReference.PopFrames
  • d10467e: 8307738: JFR: EventStream.openRepository() drops events
  • 736b90d: 8308310: HttpClient: Avoid logging or locking from within synchronized blocks
  • 7764f46: 8308761: New test TestHFA needs adaptation for JDK-8308276
  • d7245f7: 8305578: X11GraphicsDevice.pGetBounds() is slow in remote X11 sessions
  • 544978c: 8298066: java/util/concurrent/locks/Lock/OOMEInAQS.java timed out
  • 65c8dbe: 8171221: Remove -XX:+CheckMemoryInitialization
  • 2d4d850: 8308091: Remove unused iRegIHeapbase() matching operand
  • b7a84b0: 8308758: Problemlist compiler/c2/irTests/TestVectorConditionalMove.java
  • ... and 137 more: https://git.openjdk.org/jdk/compare/01892f9c6ed1ff6593f77b89fcc2c5da8d094abf...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 22, 2023
@tstuefe
Copy link
Member

tstuefe commented May 22, 2023

A regression test would be good, since Metaspace will evolve and so will verifications, and I would like to know what performance costs are acceptable.

@xmas92
Copy link
Member Author

xmas92 commented May 23, 2023

This seems fine. To be clear, was most of the time spent in the ChunkManager::get_chunk() and return_chunk(), and maybe purge() calls, and not much in the split and merge chunks? I agree with adding SOMETIMES to all for consistency, but I hope we're not doing more of these operations that I assume should be infrequent.

Maybe it is overly defensive to add it to the split and get_chunk paths. The observed slowdown was for purge() calls, and more specifically the verification downstream of ~MetaspaceArena(), which includes return_chunk and merge() as well as freelist contains() verification.

@xmas92
Copy link
Member Author

xmas92 commented May 23, 2023

A regression test would be good, since Metaspace will evolve and so will verifications, and I would like to know what performance costs are acceptable.

Oracles internal testing will most likely provoke regressions in this area. Similarly how this issue was brought to light. To see the really degenerate behaviour, required constant loading and unloading of many class loaders with very few classes over hours with infrequent class unloading (due to generational ZGC not performing many major collections, low old heap pressure).

While it is probably possible to reduce the variables and create a simpler test which solely stresses metaspace debug timings and little else, it seems like a rather complex task.

Any product performance regression is caught by our normal performance testing and any relevant and degenerate debug regression would be caught by Oracle internal tests. Hopefully this is enough for now?

@tstuefe
Copy link
Member

tstuefe commented May 23, 2023

A regression test would be good, since Metaspace will evolve and so will verifications, and I would like to know what performance costs are acceptable.

Oracles internal testing will most likely provoke regressions in this area. Similarly how this issue was brought to light. To see the really degenerate behaviour, required constant loading and unloading of many class loaders with very few classes over hours with infrequent class unloading (due to generational ZGC not performing many major collections, low old heap pressure).

While it is probably possible to reduce the variables and create a simpler test which solely stresses metaspace debug timings and little else, it seems like a rather complex task.

Any product performance regression is caught by our normal performance testing and any relevant and degenerate debug regression would be caught by Oracle internal tests. Hopefully this is enough for now?

Sure, if you guys are okay with defusing my future too-eager verifications :-) ?

I'd be happy with something simple and pragmatic. It does not have to be automatic. Tell me a text X and a time Y it should not overstep on reasonable hardware. Can be very fuzzy. Then I can test this myself instead of causing work for you.

@coleenp
Copy link
Contributor

coleenp commented May 23, 2023

Maybe it is overly defensive to add it to the split and get_chunk paths. The observed slowdown was for purge() calls, and more specifically the verification downstream of ~MetaspaceArena(), which includes return_chunk and merge() as well as freelist contains() verification.

Ok, thanks for the info. That makes sense.

I thought there were more GC tests for class unloading but I can't find them. If you create one test with loading/unloading a class 100 times with the SOMETIMES verification on, I think that would run fast enough and exercise the code paths that won't be exercised regularly. You can put it in runtime/ClassUnload with some other tests. Or if you can find tests already in metaspace that do this, that would be sufficient.

@xmas92
Copy link
Member Author

xmas92 commented May 24, 2023

Maybe it is overly defensive to add it to the split and get_chunk paths. The observed slowdown was for purge() calls, and more specifically the verification downstream of ~MetaspaceArena(), which includes return_chunk and merge() as well as freelist contains() verification.

Ok, thanks for the info. That makes sense.

I thought there were more GC tests for class unloading but I can't find them. If you create one test with loading/unloading a class 100 times with the SOMETIMES verification on, I think that would run fast enough and exercise the code paths that won't be exercised regularly. You can put it in runtime/ClassUnload with some other tests. Or if you can find tests already in metaspace that do this, that would be sufficient.

test/jdk/jdk/jfr/event/runtime/TestMetaspaceAllocationFailure.java and test/hotspot/jtreg/runtime/Metaspace/FragmentMetaspace.java seems like relevant candidates for running with -XX: VerifyMetaspaceInterval=1

I'll look at creating a more fuzzy stress test which uses jdk.test.lib.classloader.* to load and unload class loaders.

@coleenp
Copy link
Contributor

coleenp commented May 24, 2023

I'll look at creating a more fuzzy stress test which uses jdk.test.lib.classloader.* to load and unload class loaders.

There's a ClassUnloadCommon.java class for help with writing class loading/unloading tests.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@xmas92
Copy link
Member Author

xmas92 commented May 25, 2023

Thanks for the review. I will integrate this now to improve the CI testing for tests that may stumble on this.
Created JDK-8308848 for creating more tests as discussed.
/integrate

@openjdk
Copy link

openjdk bot commented May 25, 2023

Going to push as commit 8d8153e.
Since your change was applied there have been 160 commits pushed to the master branch:

  • d877134: 8144891: ToolBox should use java.nio.file.Path internally, instead of java.io.File
  • aaa6189: 8308506: Reduce testing time by removing combinations tested
  • 4500bb7: 8308727: Compiler should accept final unnamed variables in try-with-resources
  • 3272e25: 8305710: Line breaks in search tags cause invalid JSON in index file
  • 3f4abff: 8305958: Use links instead of buttons for auto-generated header links
  • bfcae68: 8308549: Classfile API should fail to generate over-sized Code attribute
  • 2a18e53: 8308656: RISC-V: vstring_compare doesnt manifest usage of all vector registers
  • a291f00: 8288619: Unexpected parsing for @see
  • 89b9914: 8308400: add ForceEarlyReturn support for virtual threads
  • 1451ac1: 8307652: sealed class hierarchy graph doesn't distinguish non-sealed classes
  • ... and 150 more: https://git.openjdk.org/jdk/compare/01892f9c6ed1ff6593f77b89fcc2c5da8d094abf...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 25, 2023
@openjdk openjdk bot closed this May 25, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 25, 2023
@openjdk
Copy link

openjdk bot commented May 25, 2023

@xmas92 Pushed as commit 8d8153e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@xmas92 xmas92 deleted the JDK-8307958-VerifyMetaspace-Sometimes branch January 5, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants