Skip to content

Conversation

@shipilev
Copy link
Member

@shipilev shipilev commented Jul 14, 2025

See the bug for more analysis.

The short summary is that CompileQueue::delete_all walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with waiting_for_completion_count counter. This mechanism -- added by JDK-8343938 in JDK 25 to solve a similar problem -- almost works. Almost.

There is a subtle race window, where blocking waiter could have already unparked, dropped waiting_for_completion_count to 0 and proceeded to delete the task, see CompileBroker::wait_for_completion(). Then the queue deletion code could assume there are no actual waiters on the blocking task, and proceed to delete the task again. Before JDK-8357473 this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, CompileTask-s are delete-d, and the second attempt leads to double free.

I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing CompileQueue::delete_all() is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler threads, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it.

This PR summarily delegates all blocking task deletes to waiters. I think it stands to reason (and can be seen in CompilerBroker code) that if a blocking task is in queue, then there is a waiter that would call CompileBroker::wait_for_completion() on it.

Additional testing:

  • Linux AArch64 server fastdebug, tier1
  • Linux x86_64 server fastdebug, all

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8361752: Double free in CompileQueue::delete_all after JDK-8357473 (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26294/head:pull/26294
$ git checkout pull/26294

Update a local copy of the PR:
$ git checkout pull/26294
$ git pull https://git.openjdk.org/jdk.git pull/26294/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26294

View PR using the GUI difftool:
$ git pr show -t 26294

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26294.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 14, 2025

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8361752: Double free in CompileQueue::delete_all after JDK-8357473

Reviewed-by: kvn, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 86 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@shipilev The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jul 14, 2025
@shipilev shipilev marked this pull request as ready for review July 14, 2025 15:23
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 14, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 14, 2025

Webrevs

@shipilev
Copy link
Member Author

I am pretty convinced this is it. But I still struggle to reproduce the failure locally. So I would appreciate if @TobiHartmann or @dholmes-ora could give it a spin through the CI where this reproduces. Probably after JDK-8360048 lands, if that one is not a test-only bug?

@mhaessig
Copy link
Contributor

I kicked off a CI run. I'll keep you posted on the results.

Comment on lines 387 to 394
// Wake up all blocking task waiters to delete all remaining blocking
// tasks. This is not a performance sensitive path, so we do this
// unconditionally to simplify coding.
{
MonitorLocker ml(Thread::current(), CompileTaskWait_lock);
ml.notify_all();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other compiler threads which still in process of compiling for blocking tasks? They still need it CompileTask object.
delete_all() is called by one compiler thread which finished compilation but other threads may not.

I don't see any compiler thread checks shut_down state to stop compilation.

Copy link
Member Author

@shipilev shipilev Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, that's the point of the existing protocol to force waiters to delete the task: the blocking waiter would wait for compiler thread to complete the task one way or the other. This PR makes that protocol even stronger: only blocking waiters are allowed to delete the blocking task.

Copy link
Member Author

@shipilev shipilev Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, your question is what happens if we notify here, and compilations are still running? Well, I think current protocol should nominally allow waiters to wait until compilation is over and then allow them to delete the task. But then I see wait_for_compilation can exit when compilation is shut down:

    while (!task->is_complete() && !is_compilation_disabled_forever()) {
      ml.wait();
    }

This will proceed to delete the task while compiler thread is running. Grrr. Looks to be another hole in this protocol.

Copy link
Contributor

@vnkozlov vnkozlov Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can compiler thread delete its own blocking task when it finished. And let Java thread resume execution when compilation disabled as it do now but do nothing about task in such case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that works. There is no "own" blocking task, there are nearly always two threads involved: the compiler thread and the waiter (Java) thread. Waiter is checking the task status under the lock. Logically, the last user should delete the task, that is waiter.

But I think we can handle this hole by ignoring the blocking task deletion during compiler shutdown. For the same reason described in PR body: we already leave cruft behind in that case, and it costs us quite a bit of complexity to deal with every corner case during shutdown. So it seems simpler to just drop the tasks on the floor in that corner case.

I did a variant of this in new commit, seems to still work well under stress testing. More testing is running now...

@mhaessig
Copy link
Contributor

mhaessig commented Jul 15, 2025

I kicked off a CI run.

FWIW, tier1-tier3, and 100 repeats of TestStressBailout.java on Linux x64 & aarch64, Windows x64, and Mac aarch64 all passed.

Let me know when I should kick off another round.

@shipilev
Copy link
Member Author

FWIW, tier1-tier3, and 100 repeats of TestStressBailout.java on Linux x64 & aarch64, Windows x64, and Mac aarch64 all passed.

Let me know when I should kick off another round.

Thank you, that is good to know!

New version handles even more obscure corner case, that I doubt would show up easily :) My Linux x86_64 server fastdebug make test TEST=all run just completed without problems, so we can test this version more broadly as well.

@mhaessig
Copy link
Contributor

[...] we can test this version more broadly as well.

tier1 - tier3 and 100 repeats of TestStressBailout.java on Linux x64 & aarch64, Windows x64, and Mac x64 & aarch64 all passed.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 16, 2025
@shipilev
Copy link
Member Author

Thanks! I think I need another Review.

@vnkozlov
Copy link
Contributor

@iwanowww please look?

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@shipilev
Copy link
Member Author

Thank you! I re-tested locally after local merge with current master, and it still works. Here goes.

/integrate

@openjdk
Copy link

openjdk bot commented Jul 21, 2025

Going to push as commit 9609f57.
Since your change was applied there have been 91 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 21, 2025
@openjdk openjdk bot closed this Jul 21, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 21, 2025
@openjdk
Copy link

openjdk bot commented Jul 21, 2025

@shipilev Pushed as commit 9609f57.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants