Skip to content
This repository has been archived by the owner on Sep 19, 2023. It is now read-only.
/ jdk19 Public archive

8288497: add support for JavaThread::is_oop_safe() #20

Closed
wants to merge 4 commits into from
Closed

8288497: add support for JavaThread::is_oop_safe() #20

wants to merge 4 commits into from

Conversation

dcubed-ojdk
Copy link
Member

@dcubed-ojdk dcubed-ojdk commented Jun 15, 2022

A fix to add support for JavaThread::is_gc_barrier_detached() which allows
us to add checks to detect failures like:

JDK-8288139 JavaThread touches oop after GC barrier is detached
https://bugs.openjdk.org/browse/JDK-8288139

This fix along with the fix for JDK-8288139 has been tested in Mach5 Tier[1-8].
There are no related failures in Mach5 Tier[1-7]; Mach5 Tier8 is still running.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8288497: add support for JavaThread::is_oop_safe()

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk19 pull/20/head:pull/20
$ git checkout pull/20

Update a local copy of the PR:
$ git checkout pull/20
$ git pull https://git.openjdk.org/jdk19 pull/20/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20

View PR using the GUI difftool:
$ git pr show -t 20

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk19/pull/20.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 15, 2022

👋 Welcome back dcubed! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 15, 2022

@dcubed-ojdk The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Jun 15, 2022
@dcubed-ojdk dcubed-ojdk marked this pull request as ready for review June 15, 2022 15:55
// JavaThread::exit() is seen more quickly.
// JavaThread::set_terminated() is seen more quickly.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should have been updated when the code in
JavaThread::exit() was refactored into JavaThread::set_terminated().
I'm doing it as part of this fix for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bogus comment - acquire semantics has nothing to do with seeing things more quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that I wrote the original comment and you were one of my
reviewers way, way back then.

When we were trying to match up the load-acquires in the accessors with
the in-line release-store code in JavaThread::exit(), the comment was needed
(ignoring the part about "seeing things more quickly"). Now that we have
load-acquires in the accessors and a release-store in the setter, i.e., no more
in-line code in JavaThread::exit(), I no longer think we need a comment to
coordinate.

I'm in favor of dropping all instances of this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. FTR this came in with the Thread-SMR changes which were reviewed over a period of many months with some "vigorous" discussion. :) I think by the time this particular code appeared the focus was on bigger issues.

Comment on lines 271 to 281
// JavaThread::exit() is seen more quickly.
// JavaThread::set_terminated() is seen more quickly.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should have been updated when the code in
JavaThread::exit() was refactored into JavaThread::set_terminated().
I'm doing it as part of this fix for clarity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above discussion about this comment.
I'm in favor of dropping this comment.

Comment on lines -166 to +167
// JavaThread::exit() skipped calling current_thread_exiting()
// We did not get here via JavaThread::exit() so current_thread_exiting()
// was not called, e.g., JavaThread::cleanup_failed_attach_current_thread().
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK-8286830 ~HandshakeState should not touch oops
changed JavaThread::exit() so we no longer had a path that called current_thread_exiting()
and another path that did not call current_thread_exiting() so this comment should have
been clarified with that fix. I'm doing it as part of this fix for clarity.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 15, 2022
@mlbridge
Copy link

mlbridge bot commented Jun 15, 2022

Webrevs

@dcubed-ojdk
Copy link
Member Author

dcubed-ojdk commented Jun 15, 2022

@dholmes-ora, @pchilano and @robehn - Pinging you guys since we were all
involved with JDK-8286830 ~HandshakeState should not touch oops

@fisk - Pinging you since you are interested in GC barriers.

Copy link

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would just use the _thread_terminated state rather than introducing a new _thread_gc_barrier_detached state just to detect failures. So the only extra things covered by _thread_gc_barrier_detached and not by _thread_terminated would be ThreadService::remove_thread() which just updates some counters, and ThreadsSMRSupport::remove_thread() which we overlooked before but now by inspection I don't see any other place where we touch oops there.

@dcubed-ojdk
Copy link
Member Author

@pchilano - Thanks for the review!

Since we've been having issues with oop usage after the GC barrier is detached
in several bugs, I was going for the more precise control of adding the new
TerminatedTypes value. There is quite a bit of code from:

old L3600: BarrierSet::barrier_set()->on_thread_detach(p);

to

old L3623: p->set_terminated(JavaThread::_thread_terminated);

so rather than have to reason about the code over and over again,
I would rather do these sanity checks with targeted code. To paraphrase
@robehn: Why do with code inspection what you can do with code?

@dholmes-ora
Copy link
Member

@dcubed-ojdk I must admit I'm also not a fan of adding this very specialized interim state into the thread termination states. I would have expected the barrier code itself to be able to detect when on_thread_detach had been called, instead of the JavaThread having to track that for itself.

@dcubed-ojdk
Copy link
Member Author

If the folks that own the barrier code add detection checks there, then this code
can be removed in the future. For now, I would like to get this code in so that I
can continue to search for additional instances of this type of failure.

As for the addition of the _thread_gc_barrier_detached TerminatedTypes value
and the location of when that state goes active, it is that new state and it's
placement that verified that the code in ThreadsSMRSupport::remove_thread()
was in the wrong place (fixed in JDK-8288139). Depending on the
_thread_terminated TerminatedTypes value did not and would not have verified
that problem.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dan,

After looking at the barrier code I'm warming up to the proposal you have made, but with a couple of suggestions below.

Thanks.

@@ -3598,6 +3598,12 @@ void Threads::remove(JavaThread* p, bool is_daemon) {
// StackWatermarkSet::on_safepoint(), which performs GC processing,
// requiring the GC state to be alive.
BarrierSet::barrier_set()->on_thread_detach(p);
if (p->is_exiting()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this needs to be conditional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "normal" transitions for a JavaThread's _terminated field are:

_not_terminated => _thread_exiting => _thread_gc_barrier_detached => _thread_terminated

The transitions for a JavaThread that failed to attach are:

_not_terminated => _thread_terminated

because that JavaThread does not calls JavaThread::exit(). If we try to take
the attach failing JavaThread thru _thread_gc_barrier_detached, then some
of the counter code will be unhappy and we'll see assertion failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just dug up my testing notes:

gc/g1/ihop/TestIHOPStatic.java fails in fastdebug and slowdebug:

#  Internal Error (/System/Volumes/Data/work/shared/bug_hunt/8288139_for_jdk19.git/open/src/hotspot/share/services/threadService.cpp:177), pid=3359, tid=5891
#  assert(_live_threads_count->get_value() > count) failed: thread count mismatch 6 : 6 

So the purpose of the condition is so that an attach failing JavaThread has the
same transitions that it always had.

bool is_exiting() const;
// thread's GC barrier is detached or thread is terminated
bool is_gc_barrier_detached() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this based on what it allows (access to oops) rather than what it relates to deep down? e.g. can_access_oops_safely or is_oop_safe ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely rename, but naming is hard... :-)

Those names you proposed are inversions of the current name and would
require that the logic of the function be inverted.

The other names I came up before I settled on is_gc_barrier_detached():

    can_no_longer_access_oops()
    oops_are_no_longer_safe()
    cannot_safely_access_oops()

The closest match between your proposed names and mine are:

can_access_oops_safely and cannot_safely_access_oops

Other opinions are welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the style guide for naming is to use positive names to avoid double negatives when you have to negate the function - so is_safe_for_x rather than is_not_safe_for_x. I like thread->is_oop_safe() - short and clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do one more rename of the bug, the PR and the function name...

Comment on lines +1182 to +1186
// against the three non-terminated values so that a freed JavaThread
// will also be considered terminated.
bool check_is_terminated(TerminatedTypes l_terminated) const {
return l_terminated != _not_terminated && l_terminated != _thread_exiting;
return l_terminated != _not_terminated && l_terminated != _thread_exiting &&
l_terminated != _thread_gc_barrier_detached;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing: I don't think I buy this argument that we need to check all the states we don't want to see rather than the one we do. If the JavaThread has been freed we could see anything - and we should never be executing this code in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic style of this function is a carry over from before we had Thread-SMR
and we tried to detect a freed JavaThread. With Thread-SMR, we don't need to
do the check this way at all.

Do we really want to change that as part of this fix for JDK19?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it can be cleaned up later. This fix just highlighted the badness of that piece of code. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New RFE filed:

JDK-8289003 JavaThread::check_is_terminated() implementation should rely on Thread-SMR

// JavaThread::exit() is seen more quickly.
// JavaThread::set_terminated() is seen more quickly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bogus comment - acquire semantics has nothing to do with seeing things more quickly.

@fisk
Copy link

fisk commented Jun 16, 2022

I have also been bitten before by oops being touched at the boundaries of the thread life cycle, so I understand why we would want to be more precise about modelling this in the thread life cycle. The only thing I don't understand with the patch is why there are two alternative state machines now. One that goes through the new state and one that does not go through the new state. There might be a good reason for that, but I would like to understand why that is. And probably have that written in a comment where the two state machines are described.

@robehn
Copy link
Contributor

robehn commented Jun 16, 2022

Hi,

Except for the few lines of code between:
BarrierSet::barrier_set()->on_thread_detach(p) and ThreadsSMRSupport::remove();

This new state is identical to the result of "ThreadsSMRSupport::get_java_thread_list()->includes(p)".
The JavaThread is attached to the barrier set when we add it to the ThreadsList and detached when we remove it.

If we have a safepoint poll between (done with Threads_lock held):
BarrierSet::barrier_set()->on_thread_detach(p);
->
p->set_terminated(JavaThread::_thread_terminated);

We have a bug. This means that this new state is not even possible to observe from another thread (with target safe).
So for any other thread there is no difference, which means we have new global state just for a few lines of code while we are terminating with Threads_lock held.

So I don't think this one place needs a new global state.

Note that for a safe target thread the terminated flag gives the same result as:
"ThreadsSMRSupport::get_java_thread_list()->includes(p)"

So jt->is_terminated() can be implemented with "ThreadsSMRSupport::get_java_thread_list()->includes(p)". (again assuming we are not doing some racy, and only asking for threads that are safe/stable)

@zhengyu123
Copy link
Contributor

Just a suggestion: I believe adding thread state check in Thread::gc_data() method helps to catch this problem in the future, as the gc_data is usually accessed by GC barrier ...

@dcubed-ojdk
Copy link
Member Author

Wow! Okay I'm no longer considering this to be a trivial fix. I'll update the description
in a second...

@dholmes-ora, @fisk, @robehn and @zhengyu123 - Thanks for chiming in on this
review. Obviously it will take me a bit of time to navigate this variety of comments
and figure out a way forward.

On the plus side, My Mach5 Tier8 is down to 11 of the usual 48 24H tasks and so
far there are no failures. So the testing on this fix (JDK-8288497) and JDK-8288139
is look just dandy!

@dcubed-ojdk
Copy link
Member Author

dcubed-ojdk commented Jun 16, 2022

@fisk's comment from above:

The only thing I don't understand with the patch is why there are two alternative
state machines now. One that goes through the new state and one that does not
go through the new state. There might be a good reason for that, but I would like
to understand why that is. And probably have that written in a comment where
the two state machines are described.

There used to be (at least) three alternative state machines. There were two in
JavaThread::exit() and the failing JNI attach is the third. There might be more,
but I haven't found them yet.

JDK-8286830 ~HandshakeState should not touch oops
changed JavaThread::exit() so we no longer had a path that called current_thread_exiting()
and another path that did not call current_thread_exiting().

Currently we only document the one state machine. I should probably fix that.

@dcubed-ojdk
Copy link
Member Author

@zhengyu123's comment from above:

Just a suggestion: I believe adding thread state check in Thread::gc_data()
method helps to catch this problem in the future, as the gc_data is usually
accessed by GC barrier ...

I think we (Runtime) are hoping that the GC folks add something specific that
is intended to catch these kind of problems. If that happens, then this hack
that I'm doing now can be retired. Adding a "thread state" check to
Thread::gc_data() might be the solution, but I would feel more comfortable if
that change was made by GC folks that know this barrier stuff.

@dcubed-ojdk
Copy link
Member Author

@robehn's comment from above:

Except for the few lines of code between:
BarrierSet::barrier_set()->on_thread_detach(p) and ThreadsSMRSupport::remove();

This new state is identical to the result of "ThreadsSMRSupport::get_java_thread_list()->includes(p)".

No it's not equivalent. If I replace is_gc_barrier_attached()'s implementation
with your check, then it would NOT catch the code that's in the wrong place
and is being fixed by JDK-8288139. Setting the new _terminated field state
exactly where I do so in this fix is what allowed me to prove that the code
that I moved in JDK-8288139 is in the wrong place.

This means that this new state is not even possible to observe from another thread (with target safe).

This new state is NOT for another thread to use. This new state is for the calling
thread to use to make sure that it doesn't use oops after the GC barrier is
detached.

So I don't think this one place needs a new global state.

Not for the long term. @dholmes-ora and I are hoping that the GC folks eventually
add some logic in the GC subsystem that catches this condition. If that is done, then
the this is_gc_barrier_detached() support code can all be removed.

Note that for a safe target thread the terminated flag gives the same result as:
"ThreadsSMRSupport::get_java_thread_list()->includes(p)"

So jt->is_terminated() can be implemented with "ThreadsSMRSupport::get_java_thread_list()->includes(p)".
(again assuming we are not doing some racy, and only asking for threads that are safe/stable)

I don't think we want to retire the _terminated field in this bug fix. And the
replacement code is much more expensive as the ThreadsList grows due to
the search time.

@dcubed-ojdk dcubed-ojdk changed the title 8288497: add support for JavaThread::is_gc_barrier_detached() 8288497: add support for JavaThread::cannot_access_oops_safely() Jun 16, 2022
@dcubed-ojdk
Copy link
Member Author

/issue JDK-8288497

@openjdk
Copy link

openjdk bot commented Jun 16, 2022

@dcubed-ojdk This issue is referenced in the PR title - it will now be updated.

@robehn
Copy link
Contributor

robehn commented Jun 17, 2022

@robehn's comment from above:

Except for the few lines of code between:
BarrierSet::barrier_set()->on_thread_detach(p) and ThreadsSMRSupport::remove();
This new state is identical to the result of "ThreadsSMRSupport::get_java_thread_list()->includes(p)".

No it's not equivalent. If I replace is_gc_barrier_attached()'s implementation with your check, then it would NOT catch the code that's in the wrong place and is being fixed by JDK-8288139. Setting the new _terminated field state exactly where I do so in this fix is what allowed me to prove that the code that I moved in JDK-8288139 is in the wrong place.

That's why I said except between these lines :)

If you say this is the correct way to handle it for now, then it is.
Looks fine, thanks!

@openjdk
Copy link

openjdk bot commented Jun 17, 2022

@dcubed-ojdk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8288497: add support for JavaThread::is_oop_safe()

Reviewed-by: pchilanomate, dholmes, rehn, eosterlund

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 6 new commits pushed to the master branch:

  • ae030bc: 8288397: AArch64: Fix register issues in SVE backend match rules
  • f12d044: 8288692: jdk/javadoc/doclet/testTagMisuse/TestTagMisuse.java fails after JDK-8288545
  • 97544be: 8268398: 15% increase in JFR footprint in Noop-Base
  • 983f75c: 8288545: Missing space in error message
  • 53bf1bf: 8286176: Add JNI_VERSION_19 to jni.h and JNI spec
  • c254c9d: 8287401: jpackage tests failing on Windows due to powershell issue

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 17, 2022
Copy link

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, @dcubed-ojdk. This looks good to me.

@dcubed-ojdk
Copy link
Member Author

@dholmes-ora, @robehn and @fisk - Thanks for the re-reviews!
I'm going to do the rename that @dholmes-ora suggested and then
retest again...

@dcubed-ojdk
Copy link
Member Author

/issue JDK-8288497

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jun 17, 2022
@openjdk openjdk bot changed the title 8288497: add support for JavaThread::cannot_access_oops_safely() 8288497: add support for JavaThread::is_oop_safe() Jun 17, 2022
@openjdk
Copy link

openjdk bot commented Jun 17, 2022

@dcubed-ojdk This issue is referenced in the PR title - it will now be updated.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 17, 2022
Copy link

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dan, looks good.

@dcubed-ojdk
Copy link
Member Author

@pchilano - Thanks for the re-review!

@dcubed-ojdk
Copy link
Member Author

The latest version's Mach5 Tier[1-7] looks good. Mach5 Tier8 was just started.

@dholmes-ora - please re-review when you get the chance.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates Dan and your patience.

One minor query/suggestion below, but this is good as-is.

Thanks.

check_is_terminated(l_terminated);
}

inline bool JavaThread::is_oop_safe() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to document, and assert, that this is only intended for use by the current thread on itself? I can't imagine any scenario where we would need to ask whether an arbitrary thread is oop-safe. ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholmes-ora - Thanks very much for the re-review.

I'll accept your "good as-is" and proceed with integration.

I think I've tweaked this fix enough at this point. Especially if we end up replacing
it with support in the GC subsystems instead.

@dcubed-ojdk
Copy link
Member Author

The Mach5 Tier8 run had no failures. All testing looks great.

/integrate

@openjdk
Copy link

openjdk bot commented Jun 21, 2022

Going to push as commit e26d3b3.
Since your change was applied there have been 15 commits pushed to the master branch:

  • c74a923: 8288531: Empty spans in mobile navigation markup
  • af05139: 8288467: remove memory_operand assert for spilled instructions
  • b9c3966: 8288671: Problematic fix for font boosting
  • 453e8be: 8288527: broken link in java.base/java/util/zip/package-summary.html
  • 33d0363: 8288741: JFR: Change package name of snippet files
  • 0408f9c: 8288663: JFR: Disabling the JfrThreadSampler commits only a partially disabled state
  • 1cf83a4: 8287800: JFR: Incorrect error message when starting recording with missing .jfc file
  • 09da87c: 8288485: jni/nullCaller/NullCallerTest.java failing (ppc64)
  • ed714af: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114
  • ae030bc: 8288397: AArch64: Fix register issues in SVE backend match rules
  • ... and 5 more: https://git.openjdk.org/jdk19/compare/ff3db52044f69754b1ccf54961f453d4afbfba3a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 21, 2022
@openjdk openjdk bot closed this Jun 21, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 21, 2022
@openjdk
Copy link

openjdk bot commented Jun 21, 2022

@dcubed-ojdk Pushed as commit e26d3b3.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dcubed-ojdk dcubed-ojdk deleted the JDK-8288497 branch June 21, 2022 20:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
6 participants