Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8277434: tests fail with "assert(is_forwarded()) failed: only decode when actually forwarded" #6582

Closed
wants to merge 3 commits into from

Conversation

kimbarrett
Copy link

@kimbarrett kimbarrett commented Nov 28, 2021

Please review this change which fixes a rare assert failure in
G1ParScanThreadState::write_ref_field_post. The problem is that the assert is
referencing values from the forwardee, and that isn't safe. When an object is
copied and forwarded to the copy, the forwarding is performed with a relaxed
cmpxchg. This means the data being copied into the forwardee might not be
visible to a different thread that has obtained the forwardee. All uses of a
forwardee must take care to avoid reading from it without performing
additional synchronization. The assert in question violates that restriction.

The incorrect assertion is

assert(dest_attr.is_in_cset() == (obj->is_forwarded() && obj->forwardee() == obj), "...");

obj is the forwardee of a different object.

It's okay here to examine the contents of obj when it's in the cset. In
that case it's not a copy; it's the original object, self-forwarded due to
evacuation failure.

But it's not okay to examine the contents of obj when it's not in the cset,
e.g. when it's a forwardee copy of the original object. Doing so violates
the above restriction.

So that tells us this code is wrong, but it's still not obvious this is the
cause of the failure in question. The failure we're seeing appears to be
that obj->is_forwarded() returns true, but the assert of the same thing in
forwardee returns false. But is_forwarded() should never be true for a
copied forwardee. But it turns out a copied forwardee might temporarily
appear to be forwarded, due to the unordered copy + forward sequence.
Consider the following scenario:

A thread is evacuating an object and allocates a candidate forwardee, copies
into it, and finally discovers the forwarding race was lost even before the
copy. The candidate forwardee appears to be marked as forwarded. That
candidate is deallocated and the real forwardee is used by the thread.

That same thread evacuates a second object, reallocating some of the same
memory as from the previous candidate. It copies the second object into the
new forwardee, successfully performs the forwarding, and uses that newly
created forwardee.

A second thread attempts to evacuate that same second object, and finds it
already forwarded. It goes into write_ref_field_post, reaches the assert,
and attempts to use values from the forwardee. Because the copy and the
forwarding by the first thread were unordered, this second thread might see
the old copy that appears to be forwarded, rather than the up to date copy.
So the first call to is_forwarded() returns true. The forwardee() assert of
the same thing might instead get the up to date contents, resulting in the
reported failure. Alternatively, it might still get old data, in which case
nothing gets noticed because the self-forward check will fail and the
write_ref_field_post assert will (accidentally) succeed.

So there's a very small window in which the reported failure can occur. But
it's all caused by the write_ref_field_post assert performing invalid
accesses. Stop doing that and the problem should go away.

I think this failure couldn't have happened before JDK-8271579, but that's
kind of accidental and doesn't change that the write_ref_field_post assert
was performing invalid accesses. (Those accesses were introduced later.)

Testing:
mach5 tier1-5.

Ran applications/jcstress/acqrel.java 100 times without failure.
Without the change I had a 25-30% failure rate for that test.
I don't know why that test was such a good canary, when this failure seems to
otherwise be pretty rare, but glad that it was.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8277434: tests fail with "assert(is_forwarded()) failed: only decode when actually forwarded"

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6582/head:pull/6582
$ git checkout pull/6582

Update a local copy of the PR:
$ git checkout pull/6582
$ git pull https://git.openjdk.java.net/jdk pull/6582/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6582

View PR using the GUI difftool:
$ git pr show -t 6582

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6582.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 28, 2021

👋 Welcome back kbarrett! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Nov 28, 2021

@kimbarrett The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc label Nov 28, 2021
@openjdk openjdk bot added the rfr label Nov 28, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 28, 2021

Webrevs

Copy link
Member

@albertnetymk albertnetymk left a comment

Thank you for the detailed analysis; this makes perfect sense.

Copy link
Contributor

@shipilev shipilev left a comment

Ooof, so a relaxed forwardee installation comes with an odd invariant like this. I linked JDK-8204524 to this issue, to track where the forwardee installation was relaxed.

The fix for this particular issue looks fine to me, thanks.

A few follow-up questions about the general issue here. How is it not affecting the other places where G1 accesses obj->forwardee(), like in G1ScanClosureBase::prefetch_and_push or G1ParScanThreadState::do_partial_array? Also, it is not theoretically clear how would "All uses of a forwardee must take care to avoid reading from it without performing additional synchronization", given that the trouble had already happened on writer side, and there is little a reader can do to recover?

@openjdk
Copy link

@openjdk openjdk bot commented Nov 29, 2021

@kimbarrett This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8277434: tests fail with "assert(is_forwarded()) failed: only decode when actually forwarded"

Reviewed-by: ayang, shade, tschatzl

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Nov 29, 2021
@tschatzl
Copy link
Contributor

@tschatzl tschatzl commented Nov 29, 2021

Ooof, so a relaxed forwardee installation comes with an odd invariant like this. I linked JDK-8204524 to this issue, to track where the forwardee installation was relaxed.

The fix for this particular issue looks fine to me, thanks.

A few follow-up questions about the general issue here. How is it not affecting the other places where G1 accesses obj->forwardee(), like in G1ScanClosureBase::prefetch_and_push or G1ParScanThreadState::do_partial_array? Also, it is not theoretically clear how would "All uses of a forwardee must take care to avoid reading from it without performing additional synchronization", given that the trouble had already happened on writer side, and there is little a reader can do to recover?

Both cases examine the original object in the collection set only; for the first example there is an explicit check on is_in_cset before the call (G1ScanEvacuatedObjClosure::do_oop_work is also only ever called from the thread that copied the object too; but all G1ScanClosureBase::prefetch_and_push are guarded by an is_in_cset check). from_obj in G1ParScanThreadState::do_partial_array is always in the collection set, i.e.

assert(_g1h->is_in_reserved(from_obj), "must be in heap.");

could be rephrased as g1h->is_in_cset(from_obj).

Copy link
Contributor

@tschatzl tschatzl left a comment

lgtm

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 29, 2021

Mailing list message from Kim Barrett on hotspot-gc-dev:

On Nov 29, 2021, at 3:35 AM, Thomas Schatzl <tschatzl at openjdk.java.net> wrote:

On Mon, 29 Nov 2021 08:07:34 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

Ooof, so a relaxed forwardee installation comes with an odd invariant like this. I linked JDK-8204524 to this issue, to track where the forwardee installation was relaxed.

The fix for this particular issue looks fine to me, thanks.

A few follow-up questions about the general issue here. How is it not affecting the other places where G1 accesses `obj->forwardee()`, like in `G1ScanClosureBase::prefetch_and_push` or `G1ParScanThreadState::do_partial_array`? Also, it is not theoretically clear how would "All uses of a forwardee must take care to avoid reading from it without performing additional synchronization", given that the trouble had already happened on writer side, and there is little a reader can do to recover?

Both cases examine the original object in the collection set only; for the first example there is an explicit check on `is_in_cset` before the call (`G1ScanEvacuatedObjClosure::do_oop_work` is also only ever called from the thread that copied the object too; but all `G1ScanClosureBase::prefetch_and_push` are guarded by an `is_in_cset` check). `from_obj` in `G1ParScanThreadState::do_partial_array` is always in the collection set, i.e.

assert(_g1h->is_in_reserved(from_obj), "must be in heap.");

could be rephrased as `g1h->is_in_cset(from_obj)`.

The key point in the G1ScanEvacuatedObjClosure::do_oop_work case is that the
thread that copied the object is the thread doing the prefetch_and_push. It's
buried in the obj->oop_iterate_backwards call in do_copy_to_survivor_space.

For do_partial_array, while the from_obj is in the collection set, its
forwardee is not, and might also not be up to date if not for the additional
synchronization provided by being distributed through the taskqueue.
start_partial_objarray pushes tasks onto the taskqueue, which involves a
release-store. And the thread that obtains the task does a pop-local or
pop-global from the taskqueue, which involves acquire or fence. So the
contents of the forwardee in the object in the partial-array task are
up-to-date when processing the task. That's an example of "additional
synchronization" I was referring to.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 29, 2021

All right then, thanks for explanations.

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Nov 30, 2021

Thanks for reviews @albertnetymk , @shipilev , and @tschatzl .

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Nov 30, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Nov 30, 2021

Going to push as commit 15a6806.

@openjdk openjdk bot closed this Nov 30, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Nov 30, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 30, 2021

@kimbarrett Pushed as commit 15a6806.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@kimbarrett kimbarrett deleted the fix_forwarding_assert branch Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc integrated
4 participants