Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8271348: Add stronger sanity check of thread state when polling for safepoint/handshakes #4936

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
@@ -705,45 +705,34 @@ void SafepointSynchronize::block(JavaThread *thread) {
}

JavaThreadState state = thread->thread_state();
assert(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state);
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you shouldn't need to say SafepointSynchronize:: here.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will fix it.

thread->frame_anchor()->make_walkable(thread);

uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();
// Check that we have a valid thread_state at this point
switch(state) {
case _thread_in_vm_trans:
case _thread_in_Java: // From compiled code
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:

// We have no idea where the VMThread is, it might even be at next safepoint.
// So we can miss this poll, but stop at next.

// Load dependent store, it must not pass loading of safepoint_id.
thread->safepoint_state()->set_safepoint_id(safepoint_id); // Release store
// We have no idea where the VMThread is, it might even be at next safepoint.
// So we can miss this poll, but stop at next.

// This part we can skip if we notice we miss or are in a future safepoint.
OrderAccess::storestore();
// Load in wait barrier should not float up
thread->set_thread_state_fence(_thread_blocked);
// Load dependent store, it must not pass loading of safepoint_id.
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing: I struggle to understand what this comment means - we are storing the value of safepoint_id so I don't see how the loading of safepoint_id can be reordered. ???

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to check to see who added that comment.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added here:

commit bec8431683a36ad552a15cd7c4d5ca48058249a7
Author: Robbin Ehn <rehn@openjdk.org>
Date:   Fri Feb 15 14:15:10 2019 +0100

    8203469: Faster safepoints
    
    Reviewed-by: dcubed, pchilanomate, dholmes, acorn, coleenp, eosterlund

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the code block:

  uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();
  // Check that we have a valid thread_state at this point
  switch(state) {
    case _thread_in_vm_trans:
    case _thread_in_Java:        // From compiled code
    case _thread_in_native_trans:
    case _thread_blocked_trans:
    case _thread_new_trans:

      // We have no idea where the VMThread is, it might even be at next safepoint.
      // So we can miss this poll, but stop at next.

      // Load dependent store, it must not pass loading of safepoint_id.
      thread->safepoint_state()->set_safepoint_id(safepoint_id); // Release store

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think @robehn was trying to separate these two lines of code:

uint64_t safepoint_id = SafepointSynchronize::safepoint_counter();

and

thread->safepoint_state()->set_safepoint_id(safepoint_id);

I think the same situation applies in the updated code, but it is harder
to see in this GitHub view. Especially now that I've added all these
comments.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and apologies for the digression. I'm assuming the intended required order is:

  • load safepoint counter into safepoint_id
  • load thread safepoint state
  • store safepoint_id into safepoint_state

But the store is a release_store, so it is effectively preceded by a LoadStore|SoreStore barrier. So both loads must come before the store. The loads themselves could be reordered AFAICS but with no affect on correctness. So I remain unclear about the "load dependent store" comment actually relates to.
Oh well, not a problem for this PR.

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so we're done with this one for this PR.

thread->safepoint_state()->set_safepoint_id(safepoint_id); // Release store

_wait_barrier->wait(static_cast<int>(safepoint_id));
assert(_state != _synchronized, "Can't be");
// This part we can skip if we notice we miss or are in a future safepoint.
OrderAccess::storestore();
// Load in wait barrier should not float up
thread->set_thread_state_fence(_thread_blocked);

// If barrier is disarmed stop store from floating above loads in barrier.
OrderAccess::loadstore();
thread->set_thread_state(state);
_wait_barrier->wait(static_cast<int>(safepoint_id));
assert(_state != _synchronized, "Can't be");

// Then we reset the safepoint id to inactive.
thread->safepoint_state()->reset_safepoint_id(); // Release store
// If barrier is disarmed stop store from floating above loads in barrier.
OrderAccess::loadstore();
thread->set_thread_state(state);

OrderAccess::fence();
// Then we reset the safepoint id to inactive.
thread->safepoint_state()->reset_safepoint_id(); // Release store

break;
OrderAccess::fence();

default:
fatal("Illegal threadstate encountered: %d", state);
}
guarantee(thread->safepoint_state()->get_safepoint_id() == InactiveSafepointCounter,
"The safepoint id should be set only in block path");

@@ -126,6 +126,19 @@ class SafepointSynchronize : AllStatic {
JavaThread *thread,
uint64_t safepoint_count);

static bool is_a_block_safe_state(JavaThreadState state) {
// Check that we have a valid thread_state before blocking for safepoints
switch(state) {
case _thread_in_vm_trans:
case _thread_in_Java: // From compiled code
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:
return true;
default:
return false;
}
}
// Called when a thread voluntarily blocks
static void block(JavaThread *thread);

@@ -118,6 +118,8 @@ void SafepointMechanism::process(JavaThread *thread, bool allow_suspend) {
// local poll already checked, if used.
bool need_rechecking;
do {
JavaThreadState state = thread->thread_state();
guarantee(SafepointSynchronize::is_a_block_safe_state(state), "Illegal threadstate encountered: %d", state);
Copy link
Member

@dholmes-ora dholmes-ora Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the loop? Though we are doubling up given the assert in block().
Do we really want guarantee rather than assert? Doesn't a failure here indicate an internal programming error?

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the thread's state can change during this loop so having the check in the
loop will catch if it changes to something bad.

The assert() in block() is to catch any future new callers to block() that aren't
calling from the right thread_state.

Yes, I specifically wanted a guarantee() here to catch this condition in 'release' bits.
The original internal programming bug was racy and I want to make sure we have the
best chance to catch any future racy uses.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect any state change to restore the original state before we get back into this loop. So this seems a little paranoid, but okay I guess.

Copy link
Member

@dholmes-ora dholmes-ora Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No on further thought I'm not sure about this. If we take the path to SS::block() then this guarantee must hold. But what if we don't take that path? What if this is called due to a local poll and the thread is executing code that precludes the possibility of a global poll (e.g. holds Threads_lock) - what are the potential valid states in that case?

Copy link
Member Author

@dcubed-ojdk dcubed-ojdk Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last comment is exactly why I want this guarantee() here. If we run into
that set of conditions I want us to crash and investigate what the heck is going
on here. It's also the reason why @pchilano and I agreed that this change should
be targeted to jdk/jdk (JDK18) instead of JDK17. We want time for this change
to percolate in the system.

While I've done a lot of Mach5 test runs for this change (plus the fix for JDK-8271251),
there is no substitute for letting this change bake for a couple of months...

if (global_poll()) {
// Any load in ::block() must not pass the global poll load.
// Otherwise we might load an old safepoint counter (for example).