New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8271348: Add stronger sanity check of thread state when polling for safepoint/handshakes #4936
Conversation
…afepoint/handshakes
/label add hotspot-runtime |
|
@dcubed-ojdk |
Webrevs
|
Just to be clear: While I tested this fix with JDK17 bits, this fix is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dan,
Not sure about part of this - see below.
Thanks,
David
@@ -705,45 +705,34 @@ void SafepointSynchronize::block(JavaThread *thread) { | |||
} | |||
|
|||
JavaThreadState state = thread->thread_state(); | |||
assert(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you shouldn't need to say SafepointSynchronize:: here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix it.
OrderAccess::storestore(); | ||
// Load in wait barrier should not float up | ||
thread->set_thread_state_fence(_thread_blocked); | ||
// Load dependent store, it must not pass loading of safepoint_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing: I struggle to understand what this comment means - we are storing the value of safepoint_id so I don't see how the loading of safepoint_id can be reordered. ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to check to see who added that comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added here:
commit bec8431683a36ad552a15cd7c4d5ca48058249a7
Author: Robbin Ehn <rehn@openjdk.org>
Date: Fri Feb 15 14:15:10 2019 +0100
8203469: Faster safepoints
Reviewed-by: dcubed, pchilanomate, dholmes, acorn, coleenp, eosterlund
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the code block:
uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();
// Check that we have a valid thread_state at this point
switch(state) {
case _thread_in_vm_trans:
case _thread_in_Java: // From compiled code
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:
// We have no idea where the VMThread is, it might even be at next safepoint.
// So we can miss this poll, but stop at next.
// Load dependent store, it must not pass loading of safepoint_id.
thread->safepoint_state()->set_safepoint_id(safepoint_id); // Release store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think @robehn was trying to separate these two lines of code:
uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();
and
thread->safepoint_state()->set_safepoint_id(safepoint_id);
I think the same situation applies in the updated code, but it is harder
to see in this GitHub view. Especially now that I've added all these
comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and apologies for the digression. I'm assuming the intended required order is:
- load safepoint counter into safepoint_id
- load thread safepoint state
- store safepoint_id into safepoint_state
But the store is a release_store, so it is effectively preceded by a LoadStore|SoreStore barrier. So both loads must come before the store. The loads themselves could be reordered AFAICS but with no affect on correctness. So I remain unclear about the "load dependent store" comment actually relates to.
Oh well, not a problem for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so we're done with this one for this PR.
JavaThreadState state = thread->thread_state(); | ||
guarantee(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be outside the loop? Though we are doubling up given the assert in block().
Do we really want guarantee rather than assert? Doesn't a failure here indicate an internal programming error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the thread's state can change during this loop so having the check in the
loop will catch if it changes to something bad.
The assert() in block()
is to catch any future new callers to block()
that aren't
calling from the right thread_state.
Yes, I specifically wanted a guarantee()
here to catch this condition in 'release' bits.
The original internal programming bug was racy and I want to make sure we have the
best chance to catch any future racy uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect any state change to restore the original state before we get back into this loop. So this seems a little paranoid, but okay I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No on further thought I'm not sure about this. If we take the path to SS::block() then this guarantee must hold. But what if we don't take that path? What if this is called due to a local poll and the thread is executing code that precludes the possibility of a global poll (e.g. holds Threads_lock) - what are the potential valid states in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last comment is exactly why I want this guarantee()
here. If we run into
that set of conditions I want us to crash and investigate what the heck is going
on here. It's also the reason why @pchilano and I agreed that this change should
be targeted to jdk/jdk (JDK18) instead of JDK17. We want time for this change
to percolate in the system.
While I've done a lot of Mach5 test runs for this change (plus the fix for JDK-8271251),
there is no substitute for letting this change bake for a couple of months...
@pchilano - Since this is your version of the fix with one small change from |
/contributor add @pchilano |
@dcubed-ojdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks,
Patricio
@pchilano - Thanks for the review! |
Since @pchilano is listed as the Contributor, I need another (R)eviewer. |
@dcubed-ojdk This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 18 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
|
@dholmes-ora - Thanks for the re-review. |
/integrate |
Going to push as commit db950ca.
Your commit was automatically rebased without conflicts. |
@dcubed-ojdk Pushed as commit db950ca. |
Mailing list message from David Holmes on hotspot-runtime-dev: On 31/07/2021 2:14 am, Daniel D.Daugherty wrote:
Okay but I want to be sure we revisit this before 18 ships. That Thanks, |
Mailing list message from David Holmes on hotspot-runtime-dev: On 1/08/2021 11:43 pm, Daniel D.Daugherty wrote:
Done. :) Thanks, |
Mailing list message from daniel.daugherty at oracle.com on hotspot-runtime-dev: On 8/1/21 10:38 PM, David Holmes wrote:
For some reason, this email didn't post in the PR... We'll definitely be keeping on this situation during JDK18 testing. Dan
|
A trivial follow-up to:
JDK-8271251 JavaThread::java_suspend() fails with "fatal error: Illegal threadstate encountered: 6"
that adds a stronger sanity check of thread state when polling for safepoint/handshakes.
This fix was used to test @pchilano's fix for JDK-8271251 in my JDK17 Mach5
Tier[1-8] runs for JDK-8271251. It has also been tested with Mach5 Tier[1-3]
for jdk/jdk (JDK18).
Progress
Issue
Reviewers
Contributors
<pchilanomate@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4936/head:pull/4936
$ git checkout pull/4936
Update a local copy of the PR:
$ git checkout pull/4936
$ git pull https://git.openjdk.java.net/jdk pull/4936/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 4936
View PR using the GUI difftool:
$ git pr show -t 4936
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4936.diff