Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8271348: Add stronger sanity check of thread state when polling for safepoint/handshakes #4936

Closed
wants to merge 2 commits into from

Conversation

dcubed-ojdk
Copy link
Member

@dcubed-ojdk dcubed-ojdk commented Jul 29, 2021

A trivial follow-up to:

JDK-8271251 JavaThread::java_suspend() fails with "fatal error: Illegal threadstate encountered: 6"

that adds a stronger sanity check of thread state when polling for safepoint/handshakes.

This fix was used to test @pchilano's fix for JDK-8271251 in my JDK17 Mach5
Tier[1-8] runs for JDK-8271251. It has also been tested with Mach5 Tier[1-3]
for jdk/jdk (JDK18).


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8271348: Add stronger sanity check of thread state when polling for safepoint/handshakes

Reviewers

Contributors

  • Patricio Chilano Mateo <pchilanomate@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4936/head:pull/4936
$ git checkout pull/4936

Update a local copy of the PR:
$ git checkout pull/4936
$ git pull https://git.openjdk.java.net/jdk pull/4936/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4936

View PR using the GUI difftool:
$ git pr show -t 4936

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4936.diff

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Jul 29, 2021

/label add hotspot-runtime

@dcubed-ojdk dcubed-ojdk marked this pull request as ready for review Jul 29, 2021
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jul 29, 2021

👋 Welcome back dcubed! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added rfr hotspot-runtime labels Jul 29, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jul 29, 2021

@dcubed-ojdk
The hotspot-runtime label was successfully added.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jul 29, 2021

Webrevs

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Jul 29, 2021

Just to be clear: While I tested this fix with JDK17 bits, this fix is
targeted to jdk/jdk (JDK18).

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Hi Dan,

Not sure about part of this - see below.

Thanks,
David

@@ -705,45 +705,34 @@ void SafepointSynchronize::block(JavaThread *thread) {
}

JavaThreadState state = thread->thread_state();
assert(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state);
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you shouldn't need to say SafepointSynchronize:: here.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix it.

OrderAccess::storestore();
// Load in wait barrier should not float up
thread->set_thread_state_fence(_thread_blocked);
// Load dependent store, it must not pass loading of safepoint_id.
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing: I struggle to understand what this comment means - we are storing the value of safepoint_id so I don't see how the loading of safepoint_id can be reordered. ???

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check to see who added that comment.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added here:

commit bec8431683a36ad552a15cd7c4d5ca48058249a7
Author: Robbin Ehn <rehn@openjdk.org>
Date:   Fri Feb 15 14:15:10 2019 +0100

    8203469: Faster safepoints
    
    Reviewed-by: dcubed, pchilanomate, dholmes, acorn, coleenp, eosterlund

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the code block:

  uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();
  // Check that we have a valid thread_state at this point
  switch(state) {
    case _thread_in_vm_trans:
    case _thread_in_Java:        // From compiled code
    case _thread_in_native_trans:
    case _thread_blocked_trans:
    case _thread_new_trans:

      // We have no idea where the VMThread is, it might even be at next safepoint.
      // So we can miss this poll, but stop at next.

      // Load dependent store, it must not pass loading of safepoint_id.
      thread->safepoint_state()->set_safepoint_id(safepoint_id); // Release store

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think @robehn was trying to separate these two lines of code:

uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();

and

thread->safepoint_state()->set_safepoint_id(safepoint_id);

I think the same situation applies in the updated code, but it is harder
to see in this GitHub view. Especially now that I've added all these
comments.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and apologies for the digression. I'm assuming the intended required order is:

  • load safepoint counter into safepoint_id
  • load thread safepoint state
  • store safepoint_id into safepoint_state

But the store is a release_store, so it is effectively preceded by a LoadStore|SoreStore barrier. So both loads must come before the store. The loads themselves could be reordered AFAICS but with no affect on correctness. So I remain unclear about the "load dependent store" comment actually relates to.
Oh well, not a problem for this PR.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so we're done with this one for this PR.

JavaThreadState state = thread->thread_state();
guarantee(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state);
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the loop? Though we are doubling up given the assert in block().
Do we really want guarantee rather than assert? Doesn't a failure here indicate an internal programming error?

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the thread's state can change during this loop so having the check in the
loop will catch if it changes to something bad.

The assert() in block() is to catch any future new callers to block() that aren't
calling from the right thread_state.

Yes, I specifically wanted a guarantee() here to catch this condition in 'release' bits.
The original internal programming bug was racy and I want to make sure we have the
best chance to catch any future racy uses.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect any state change to restore the original state before we get back into this loop. So this seems a little paranoid, but okay I guess.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No on further thought I'm not sure about this. If we take the path to SS::block() then this guarantee must hold. But what if we don't take that path? What if this is called due to a local poll and the thread is executing code that precludes the possibility of a global poll (e.g. holds Threads_lock) - what are the potential valid states in that case?

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last comment is exactly why I want this guarantee() here. If we run into
that set of conditions I want us to crash and investigate what the heck is going
on here. It's also the reason why @pchilano and I agreed that this change should
be targeted to jdk/jdk (JDK18) instead of JDK17. We want time for this change
to percolate in the system.

While I've done a lot of Mach5 test runs for this change (plus the fix for JDK-8271251),
there is no substitute for letting this change bake for a couple of months...

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Jul 30, 2021

@pchilano - Since this is your version of the fix with one small change from
me, it would be good if you could review here to make sure that I have the
changes all correct in the jdk/jdk (JDK18) context.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Jul 30, 2021

/contributor add @pchilano

@openjdk
Copy link

@openjdk openjdk bot commented Jul 30, 2021

@dcubed-ojdk
Contributor Patricio Chilano Mateo <pchilanomate@openjdk.org> successfully added.

Copy link
Contributor

@pchilano pchilano left a comment

LGTM!

Thanks,
Patricio

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Jul 30, 2021

@pchilano - Thanks for the review!

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Aug 1, 2021

Since @pchilano is listed as the Contributor, I need another (R)eviewer.
I can't be listed as a (R)eviewer because I created the PR. @dholmes-ora,
if you are okay with the latest version (and my replies), that would work.

@openjdk
Copy link

@openjdk openjdk bot commented Aug 2, 2021

@dcubed-ojdk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8271348: Add stronger sanity check of thread state when polling for safepoint/handshakes

Co-authored-by: Patricio Chilano Mateo <pchilanomate@openjdk.org>
Reviewed-by: dholmes, pchilanomate

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 18 new commits pushed to the master branch:

  • 7cc1eb3: Merge
  • e351de3: 8271272: C2: assert(!had_error) failed: bad dominance
  • 6180cf1: 8271512: ProblemList serviceability/sa/sadebugd/DebugdConnectTest.java due to 8270326
  • a1b5b81: 8271507: ProblemList SA tests that are failing with ZGC due to JDK-8248912
  • 4bc9b04: 8263059: security/infra/java/security/cert/CertPathValidator/certification/ComodoCA.java fails due to revoked cert
  • d6bb846: 8248899: security/infra/java/security/cert/CertPathValidator/certification/QuoVadisCA.java fails, Certificate has been revoked
  • 71ca0c0: 8270848: Redundant unsafe opmask register allocation in some instruction patterns.
  • 6c68ce2: 8270947: AArch64: C1: use zero_words to initialize all objects
  • cd7e30e: 8271242: Add Arena regression tests
  • 5b3c418: 8270321: Startup regressions in 18-b5 caused by JDK-8266310
  • ... and 8 more: https://git.openjdk.java.net/jdk/compare/d09b028407ff9d0e8c2dfd9cc5d0dca19c4497e3...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Aug 2, 2021
@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Aug 2, 2021

@dholmes-ora - Thanks for the re-review.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Aug 2, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Aug 2, 2021

Going to push as commit db950ca.
Since your change was applied there have been 26 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Aug 2, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Aug 2, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Aug 2, 2021

@dcubed-ojdk Pushed as commit db950ca.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dcubed-ojdk dcubed-ojdk deleted the JDK-8271348 branch Aug 2, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 1, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 31/07/2021 2:14 am, Daniel D.Daugherty wrote:

On Fri, 30 Jul 2021 01:23:24 GMT, David Holmes <dholmes at openjdk.org> wrote:

I would expect any state change to restore the original state before we get back into this loop. So this seems a little paranoid, but okay I guess.

No on further thought I'm not sure about this. If we take the path to SS::block() then this guarantee must hold. But what if we don't take that path? What if this is called due to a local poll and the thread is executing code that precludes the possibility of a global poll (e.g. holds Threads_lock) - what are the potential valid states in that case?

Your last comment is exactly why I want this `guarantee()` here. If we run into
that set of conditions I want us to crash and investigate what the heck is going
on here. It's also the reason why @pchilano and I agreed that this change should
be targeted to jdk/jdk (JDK18) instead of JDK17. We want time for this change
to percolate in the system.

While I've done a lot of Mach5 test runs for this change (plus the fix for JDK-8271251),
there is no substitute for letting this change bake for a couple of months...

Okay but I want to be sure we revisit this before 18 ships. That
guarantee seems potentially stronger than required - hopefully we will
catch that during testing.

Thanks,
David

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 1, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 1/08/2021 11:43 pm, Daniel D.Daugherty wrote:

On Fri, 30 Jul 2021 16:46:06 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:

dholmes CR

LGTM!

Thanks,
Patricio

Since @pchilano is listed as the Contributor, I need another (R)eviewer.
I can't be listed as a (R)eviewer because I created the PR. @dholmes-ora,
if you are okay with the latest version (and my replies), that would work.

Done. :)

Thanks,
David

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 1, 2021

Mailing list message from daniel.daugherty at oracle.com on hotspot-runtime-dev:

On 8/1/21 10:38 PM, David Holmes wrote:

On 31/07/2021 2:14 am, Daniel D.Daugherty wrote:

On Fri, 30 Jul 2021 01:23:24 GMT, David Holmes <dholmes at openjdk.org>
wrote:

I would expect any state change to restore the original state
before we get back into this loop. So this seems a little paranoid,
but okay I guess.

No on further thought I'm not sure about this. If we take the path
to SS::block() then this guarantee must hold. But what if we don't
take that path? What if this is called due to a local poll and the
thread is executing code that precludes the possibility of a global
poll (e.g. holds Threads_lock) - what are the potential valid states
in that case?

Your last comment is exactly why I want this `guarantee()` here. If
we run into
that set of conditions I want us to crash and investigate what the
heck is going
on here. It's also the reason why @pchilano and I agreed that this
change should
be targeted to jdk/jdk (JDK18) instead of JDK17. We want time for
this change
to percolate in the system.

While I've done a lot of Mach5 test runs for this change (plus the
fix for JDK-8271251),
there is no substitute for letting this change bake for a couple of
months...

Okay but I want to be sure we revisit this before 18 ships. That
guarantee seems potentially stronger than required - hopefully we will
catch that during testing.

For some reason, this email didn't post in the PR...

We'll definitely be keeping on this situation during JDK18 testing.

Dan

Thanks,
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime integrated
3 participants