-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8293252: Shenandoah: ThreadMXBean synchronizer tests crash with aggressive heuristics #10268
Conversation
…gressive mode Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
👋 Welcome back ashu-mehra! A progress list of the required criteria for merging this PR into |
@ashu-mehra The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
@@ -175,6 +175,8 @@ class ShenandoahBarrierSet: public BarrierSet { | |||
template <typename T> | |||
static oop oop_atomic_xchg_not_in_heap(T* addr, oop new_value); | |||
|
|||
template <typename T> | |||
static void oop_store_common(T* addr, oop value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail and a public part of the barrier set. We wouldn't expect users - other than the two existing callers - to ever use this api.
Can it be refactored out of this header and kept entirely within shenandoahBarrierSet.inline.hpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made oop_store_common
a private method. It needs to be part of ShenandoahBarrierSet::AccessBarrier
to be able to access RawAccessBarrier::oop_store()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DanHeidinga are you okay with the updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understand why VM_ThreadDump
would be modifying oops?
shenandoah_assert_marked_if(NULL, value, !CompressedOops::is_null(value) && ShenandoahHeap::heap()->is_evacuation_in_progress()); | ||
shenandoah_assert_not_in_cset_if(addr, value, value != NULL && !ShenandoahHeap::heap()->cancelled_gc()); | ||
shenandoah_assert_not_in_cset_if(addr, value, value != NULL && !ShenandoahHeap::heap()->cancelled_gc() && ShenandoahHeap::heap()->is_concurrent_mark_in_progress()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collection set is chosen during final mark and should always be empty during concurrent mark, so restricting this assertion to run only when concurrent mark is in progress effectively disables it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we are better off removing this assertion completely, because the assertion that the value
is not in collection set is also not correct as is evident from this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried the assertion is correct and the thread dump is doing something wrong. Why is it writing an oop? Should it go through the load reference barrier first and evacuate the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first impression as well - that we are missing load reference barrier somewhere. But then I realized we do hit the LRB when the JVM tires to load oop from the OopHandle using OopHandle::resolve
, as is done here [0]
That prompted me to think there is nothing wrong in VM_ThreadDump
and the problem lies in the assert conditions, more so because oop_store_in_heap()
has that additional condition of is_concurrent_mark_in_progress()
.
With the additional condition I added in this PR, I am currently hitting another assert with the same test case in the same code area. The sequence of events leading to the new assertion failure are:
T1: GC cycle started and marked object A
T2: "Concurrent strong root" is completed which evacuates and updates the non-heap strong roots
T3: Concurrent evacuation started
T4: VMThread running VM_ThreadDump creates a new non-heap strong root R to object A (by creating OopHandle)
T5: VMThread completes VM_ThreadDump operation
T4: Concurrent evacuation moves object A
T5: GC cycle completes
T6: New GC cycle starts (in my case it was because the test was running with "aggressive" heuristics)
T7: Concurrent root marking comes across R and crashes trying to access A which has been evacuated
Initially I was thinking of resolving it by moving "Concurrent strong root" phase after "Concurrent evacuation" so that any new non-heap strong roots which may have been created during evacuation are also properly updated.
But now I am skeptic. It seems the underlying problem for both these asserts is the same - the object was not evacuated when it was handed over to VM_ThreadDump.
VM_ThreadDump
calls HeapInspection::find_instances_at_safepoint
to find instances of a particular class at safepoint which uses GC API object_iterate()
. In case of Shenandoah GC core of this API is in ObjectIterateScanRootClosure::do_oop_work()
[1] which does not have the code to evacuate the object. So probably the right way to fix these issues is to add a call to _heap->evacuate_object()
in ObjectIterateScanRootClosure::do_oop_work
. what do you think?
[0]
jdk/src/hotspot/share/services/management.cpp
Line 1288 in 6beeb84
synchronizers_array->obj_at_put(k, locks->at(k).resolve()); |
[1]
void do_oop_work(T* p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, OopHandle::resolve
should have gone through the LRB, which should have evacuated the object. If the object is in the cset
and it didn't get evacuated by the LRB, then it should be true that the gc is cancelled (out-of-memory during evac). Did this happen under low memory conditions? I wonder if the VM thread did try to evacuate it and fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks related: https://bugs.openjdk.org/browse/JDK-8235324. The CollectedHeap::keep_alive
API doesn't support the case where the given oop would need to be evacuated and return the new location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out that bug. It explains the need for the keep_alive API. But that API is solving a different issue than what we are facing here. I think we are on right track with respect to adding LRB during object iteration.
I am going to try changing ObjectIterateScanRootClosure
to use HeapAccess<AS_NO_KEEPALIVE>
instead of RawAccess
.
I have been deliberating over the need for if (_heap->is_concurrent_weak_root_in_progress() && !_marking_context->is_marked(obj))
check in ObjectIterateScanRootClosure
after we switch to HeapAccess. The same check is executed by LRB [1] if ON_PHANTOM_OOP_REF
decorator is also set, which makes me think we should probably use HeapAccess<AS_NO_KEEPALIVE | ON_PHANTOM_OOP_REF>
and remove that check from ObjectIterateScanRootClosure
. wdyt?
[1]
jdk/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp
Lines 104 to 108 in 6beeb84
if ((decorators & ON_PHANTOM_OOP_REF) != 0 && | |
_heap->is_concurrent_weak_root_in_progress() && | |
!_heap->marking_context()->is_marked(obj)) { | |
return NULL; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added commit that replaces the RawAccess
with HeapAccess<AS_NO_KEEPALIVE>
. I think its better not to avoid the check if (_heap->is_concurrent_weak_root_in_progress() && !_marking_context->is_marked(obj))
by using ON_PHANTOM_OOP_REF
decorator as it relies on the assumption that weak oops in the VM are similar to phantomly reachable references (which is true and may remain so, but is not as explicit as the condition itself). I have therefore kept the check for is_concurrent_weak_root_in_progress
as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to only change the one line in these object iterators:
obj = ShenandoahBarrierSet::resolve_forwarded_not_null(obj);
to
obj = ShenandoahBarrierSet::load_reference_barrier(obj);
This will evacuate objects in the cset after checking for unmarked concurrent weak roots. resolve_forwarded_not_null
presumes the object has already been evacuated, but that may not be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to just call ShenandoahBarrierSet::load_reference_barrier
directly. Thanks for pointing that out.
The relevant stack trace for the VMThread (at the point of assertion failure) is:
The VMThread is creating a new OopHandle and as part of that it creates a new oop
|
…tion Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
T o = RawAccess<>::oop_load(p); | ||
if (!CompressedOops::is_null(o)) { | ||
oop obj = CompressedOops::decode_not_null(o); | ||
oop obj = HeapAccess<AS_NO_KEEPALIVE>::oop_load(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a closer look at the LRB code, I think we should or
in ON_WEAK_OOP_REF
to the access decorators. With only the AS_NO_KEEPALIVE
decorator, the LRB will fall through all the cases here and end up evacuating a doomed weak referent (i.e., keeping the object alive).
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shenandoah/shenandoahBarrierSet.inline.hpp#L119
I'm worried that this access here could defeat the logic which is meant to keep weak references from being resurrected. It also looks like if we add in the ON_WEAK_OOP_REF
decorator, the LRB will handle this case for the object iterator too:
if (_heap->is_concurrent_weak_root_in_progress() && !_marking_context->is_marked(obj)) {
// There may be dead oops in weak roots in concurrent root phase, do not touch them.
return;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the fact that the object referred by week root would have already been evacuated by the time we hit is_concurrent_weak_root_in_progress condition.
objects referred by weak roots. Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
@earthling-amzn as you suggested I have added another commit to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thank you for working through it with me!
@earthling-amzn thank you for reviewing it and guiding the code changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, too. Thanks, Ashu!
/Roman
@ashu-mehra This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 377 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@rkennke, @phohensee) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@rkennke thank you for reviewing it. |
/integrate |
@ashu-mehra |
@rkennke can you please sponsor it as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. I'll sponsor.
/sponsor |
Going to push as commit 3675f4c.
Your commit was automatically rebased without conflicts. |
@phohensee @ashu-mehra Pushed as commit 3675f4c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
@phohensee thank you! |
@rkennke Unknown command |
Please review the fix for the assertion failure when running ThreadMXBean synchronizer code in fastdebug mode.
Please note that the assertion failure happens in regular (satb) mode as well, not just in iu mode.
At the point of assertion failure, the JVM is at a safepoint as the VMThread is executing VM_ThreadDump operation and the GC worker threads are paused after the mark phase.
Assertion happened when the VMThread performs
NativeAccess<>::oop_store()
(as part of creating anOopHandle
) on an object which has been marked and is in the collection set. But the failing assertion inoop_store_not_in_heap()
expects the oop is not in the collection set.The assertion is conditional on the statement
value != NULL && !ShenandoahHeap::heap()->cancelled_gc()
But it looks like it is missing another important condition - the GC should be in marking phase.
i.e. the assertion is valid only if it is protected by
ShenandoahHeap::heap()->is_concurrent_mark_in_progress()
.So the correct assertion should be:
In fact this assertion is already present in one of its caller (`oop_store_in_heap()1) in its negative form:
Also, it reads strange that
oop_store_in_heap()
would calloop_store_not_in_heap()
.So I have moved the code to a separate method
oop_store_common()
that gets called by bothoop_store_in_heap()
andoop_store_not_in_heap()
.Tested it by running following tests in fastdebug mode:
Signed-off-by: Ashutosh Mehra asmehra@redhat.com
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10268/head:pull/10268
$ git checkout pull/10268
Update a local copy of the PR:
$ git checkout pull/10268
$ git pull https://git.openjdk.org/jdk pull/10268/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10268
View PR using the GUI difftool:
$ git pr show -t 10268
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10268.diff