Skip to content

Conversation

walulyai
Copy link
Member

@walulyai walulyai commented Sep 10, 2025

Please review this patch to skip VM_GC_Collect_Operations if initiated after the VM shutdown process has begun. We add a _is_shutting_down flag to CollectedHeap, which is set while holding the Heap_lock. This ensures mutual exclusion with VM_GC_Collect_Operations, which also require the Heap_lock.

Skipping VM_GC_Collect_Operation would otherwise cause allocation requests to fail (resulting in OutOfMemoryError) if requesting daemon threads were allowed to continue, we instead block these threads on a monitor. They remain stalled until they are terminated as part of the VM shutdown sequence.

Testing: Tier 1-7


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8366865: Allocation GC Pauses Triggered after JVM has started shutdown (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27190/head:pull/27190
$ git checkout pull/27190

Update a local copy of the PR:
$ git checkout pull/27190
$ git pull https://git.openjdk.org/jdk.git pull/27190/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27190

View PR using the GUI difftool:
$ git pr show -t 27190

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27190.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 10, 2025

👋 Welcome back iwalulya! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 10, 2025

@walulyai This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8366865: Allocation GC Pauses Triggered after JVM has started shutdown

Reviewed-by: ayang, tschatzl

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 11 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Sep 10, 2025

@walulyai The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 10, 2025
@walulyai walulyai marked this pull request as ready for review September 10, 2025 10:14
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 10, 2025
@mlbridge
Copy link

mlbridge bot commented Sep 10, 2025

@albertnetymk
Copy link
Member

/cc hotspot-gc

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Sep 10, 2025
@openjdk
Copy link

openjdk bot commented Sep 10, 2025

@albertnetymk
The hotspot-gc label was successfully added.

// If the VM is shutting down, we may have skipped VM_CollectForAllocation.
// To avoid returning nullptr (which could cause premature OOME), we stall
// allocation requests here until the VM shutdown is complete.
MonitorLocker ml(VMExit_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if an always-zero-Semaphore works.

Copy link
Member

@albertnetymk albertnetymk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title shouldn't have G1: prefix, as it changes other GCs as well.


// Check invocations
if (skip_operation()) {
if (skip_operation() || Universe::heap()->is_shutting_down()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the new condition be inlined inside skip_operation()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first iteration I had it inlined, but then decided to make it more explicit. I can change it back if you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular preference, but the comments in skip_operation() need to be adjusted to also indicate that as additional condition.

Monitor* Terminator_lock = nullptr;
Monitor* InitCompleted_lock = nullptr;
Monitor* BeforeExit_lock = nullptr;
Monitor* VMExit_lock = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that we need yet-another exit related mutex.

Comment on lines 614 to 616
// If the VM is shutting down, we may have skipped VM_CollectForAllocation.
// To avoid returning nullptr (which could cause premature OOME), we stall
// allocation requests here until the VM shutdown is complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the shutdown sequence correctly, I don't think you can do this without risking a hang. The thread doing the shutdown can potentially execute code that requires allocation after is_shutting_down() returns true. This is due to the JVMTI events posted from before_exit:

if (JvmtiExport::should_post_thread_life()) {
    JvmtiExport::post_thread_end(thread);
  }

  // Always call even when there are not JVMTI environments yet, since environments
  // may be attached late and JVMTI must track phases of VM execution
  JvmtiExport::post_vm_death();

I think if you can't GC during shutdown then you have to simply let the allocation fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I completely overlooked the JVMTI callbacks. It’s probably better to stall with a timeout and then return an allocation failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the Universe::before_exit() call and also added a timed wait.
@dholmes-ora do you have any concerns with this approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do have concerns - sorry. Any change to the shutdown sequence needs very careful analysis. You have now changed the circumstances whereby the JVMTI events get posted. Maybe it won't matter, maybe it will - the issue is that it is very hard to determine the impact of such a change until you get notified that someone's code is now broken.

Copy link
Member Author

@walulyai walulyai Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have undone that part of the change. We can revisit it separately, that way it is easier to backout if it is problematic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created JDK-8367902

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly good, just some minor nitpicks.


// Check invocations
if (skip_operation()) {
if (skip_operation() || Universe::heap()->is_shutting_down()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular preference, but the comments in skip_operation() need to be adjusted to also indicate that as additional condition.

@walulyai walulyai changed the title 8366865: G1: Allocation GC Pauses Triggered after JVM has started shutdown 8366865: Allocation GC Pauses Triggered after JVM has started shutdown Sep 15, 2025
}

void Universe::before_exit() {
log_cpu_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move log_cpu_time()? During the review of CPUTimeUsage refactor (#26621) we discussed this choice. Given that it still includes more than just GC I think it should stay in Universe. Also the PR title does not reflect that it would include a refactor of CPUTimeUsage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason was to have the log_cpu_time and AtomicAccess::release_store(&_is_shutting_down, true) under same critical section. Otherwise, we have no guarantee that we don't continue GCs after log_cpu_time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had put log_cpu_time right before calling stop(). The stop() is the method that terminates GC threads, so no synchronization should be needed if I'm not mistaken.

Please correct me if you think I got it wrong here.

Nevertheless, any user of gc_threads_do might still iterate over terminated GC workers thread. Could we consider adding a check or assert in that method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have GCs between log_cpu_time and stop(). This reduces chances of that happening if we have log_cpu_time under same lock as setting _is_shutting_down.

Nevertheless, any user of gc_threads_do might still iterate over terminated GC workers thread. Could we consider adding a check or assert in that method?

Yes, we can have the assert in gc_threads_do, I thought this was going to be done as a follow up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved _is_shutting_down to Universe, which allowed me to restore this log_cpu_time. I also added the assert before reading os::thread_cpu_time

@JonasNorlinder
Copy link
Contributor

JonasNorlinder commented Sep 16, 2025

I think we want gc_threads_do (which we might want to avoid for performance reasons?) to check is_shutting_down() such that we can't query terminated GC workers.

@openjdk
Copy link

openjdk bot commented Sep 17, 2025

@walulyai this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout shutting_down_gcs
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Sep 17, 2025
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Sep 17, 2025
@albertnetymk
Copy link
Member

Should CollectedHeap::satisfy_failed_metadata_allocation also call stall_for_vm_shutdown? (Looking through all subclasses of VM_GC_Collect_Operation.)

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this doesn't seem to perturb the general shutdown process I have no further comment and will leave it to GC folk to approve.

Though I still wonder if this aspect of shutdown should be part of "Heap" (or Universe) rather than G1 specific.

@albertnetymk
Copy link
Member

Though I still wonder if this aspect of shutdown should be part of "Heap" (or Universe) rather than G1 specific.

Why is it G1 specific? The new bool _is_shutting_down lives inside CollectedHeap -- Serial/Parallel/G1 are all updated to use it.

static bool _fully_initialized; // true after universe_init and initialize_vtables called

// Shutdown
static volatile bool _is_shutting_down;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _is_shutting_down is in Universe, should ZGC/ZAbort hook into this too? It could be confusing that ZAbort reports that we are shutting down but the universe field does not report this. I would expect them both to report true.

@dholmes-ora
Copy link
Member

Though I still wonder if this aspect of shutdown should be part of "Heap" (or Universe) rather than G1 specific.

Why is it G1 specific? The new bool _is_shutting_down lives inside CollectedHeap -- Serial/Parallel/G1 are all updated to use it.

My mistake, I didn't track the changes enough and was just looking at all the G1 files involved.

@openjdk openjdk bot removed the hotspot-gc hotspot-gc-dev@openjdk.org label Sep 19, 2025
Comment on lines 625 to 626
MonitorLocker ml(VMExit_lock);
ml.wait(2 * MILLIUNITS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one can use ThreadBlockInVM + sleep to achieve the blocking-current-thread purpose. Then, there is no need for a new lock, as there is no critical-region anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to just using sleep.

// triggers a GC.
MonitorLocker ml(VMExit_lock);
ml.wait(2 * MILLIUNITS);
JavaThread::current()->sleep(2 * MILLIUNITS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this to the Monitor::wait too. Thanks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 22, 2025
@walulyai
Copy link
Member Author

Thanks for the reviews!

/integrate

@openjdk
Copy link

openjdk bot commented Sep 23, 2025

Going to push as commit 3e5094e.
Since your change was applied there have been 35 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 23, 2025
@openjdk openjdk bot closed this Sep 23, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 23, 2025
@openjdk
Copy link

openjdk bot commented Sep 23, 2025

@walulyai Pushed as commit 3e5094e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants