Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8288129: Shenandoah: Skynet test crashed with iu + aggressive #9982

Closed
wants to merge 2 commits into from

Conversation

ashu-mehra
Copy link
Contributor

@ashu-mehra ashu-mehra commented Aug 23, 2022

Please review this patch to fix the crash when running Loom with Shenandoah in iu+aggressive mode.

When running with -XX:+ShenandoahVerify, the issue manifests as assertion at:

   1 #
   2 # A fatal error has been detected by the Java Runtime Environment:
   3 #
   4 #  Internal Error (/home/asmehra/data/ashu-mehra/loom/src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp:92), pid=920144, tid=920156
   5 #  Error: Before Evacuation, Reachable; Must be marked in complete bitmap, except j.l.r.Reference referents
   6 
   7 Referenced from:
   8   interior location: 0x00007f0a40000090
   9   0x00007f0a40000060 - klass 0x00007f0a0d89a290 jdk.internal.vm.StackChunk
  10         allocated after mark start
  11     not after update watermark
  12         marked strong
  13         marked weak
  14     not in collection set
  15   mark: mark(is_neutral no_hash age=0)
  16   region: |    0|R  |BTE 7f0a40000000, 7f0a401ff9b0, 7f0a40200000|TAMS 7f0a40000000|UWM 7f0a401ff9b0|U  2046K|T  2046K|G     0B|S     0B|L  2046K|CP   0
  17 
  18 Object:
  19   0x00007f0c2f848a40 - klass 0x00007f0a0d89a290 jdk.internal.vm.StackChunk
  20     not allocated after mark start
  21     not after update watermark
  22     not marked strong
  23     not marked weak
  24         in collection set
  25   mark: mark(is_neutral no_hash age=0)
  26   region: | 3964|CS |BTE 7f0c2f800000, 7f0c2fa00000, 7f0c2fa00000|TAMS 7f0c2fa00000|UWM 7f0c2fa00000|U  2048K|T     0B|G  2048K|S     0B|L  1612K|CP   0
  27 
  28 Forwardee:
  29   (the object itself)

The StackChunk object 0x00007f0c2f848a40 is not marked which happens to be referenced from the parent field of the newly allocated StackChunk object 0x00007f0a40000060.
The sequence for setting StackChunk::parent is as follows. At some point during the process of freezing the continuation, the jvm does:

	continuationWrapper::_tail = stackChunk1
	stackChunk2 = allocate new StackChunk
	stackChunk2::parent = continuationWrapper::_tail
	continuationWrapper::_tail = stackChunk2

At the end of the sequence stackChunk1 is only reachable from stackChunk2. If stackChunk2 happens to be allocated after concurrent mark has started and if the shenandoahgc is using IU mode, then the stackChunk2 would would not be scanned. This results in gc missing the marking of stackChunk1 which triggers the the assertion during heap verification.

This is similar to the sequence described by @fisk here

There is code in FreezeBase::finish_freeze() to call stackChunkOopDesc::do_barriers() which triggers the gc barriers for the newly allocated StackChunk object. But it has two problems:

  1. The call to stackChunkOopDesc::do_barriers() is guarded by a flag [1] which is false for StackChunk objects allocated after marking has started [2]
  2. stackChunkOopDesc::do_barriers() currently triggers the gc barriers for the oops in the stack represented by the newly allocated StackChunk, but it ignores the oops in the StackChunk header

To fix these two issues, we need the following changes:

  1. Always enable barrier for Shenandoah IU mode (this change is same as 8288129: Shenandoah: Skynet test crashed with iu + aggressive 8288129: Shenandoah: Skynet test crashed with iu + aggressive #9494).
  2. Add the code in stackChunkOopDesc::do_barriers to run the barriers on the oops present in the stack chunk header.

[1]

if (UNLIKELY(_barriers)) {

[2]
bool ShenandoahHeap::requires_barriers(stackChunkOop obj) const {

Signed-off-by: Ashutosh Mehra asmehra@redhat.com


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8288129: Shenandoah: Skynet test crashed with iu + aggressive

Contributors

  • Zhengyu Gu <zgu@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9982/head:pull/9982
$ git checkout pull/9982

Update a local copy of the PR:
$ git checkout pull/9982
$ git pull https://git.openjdk.org/jdk pull/9982/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9982

View PR using the GUI difftool:
$ git pr show -t 9982

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9982.diff

Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 23, 2022

👋 Welcome back ashu-mehra! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 23, 2022
@openjdk
Copy link

openjdk bot commented Aug 23, 2022

@ashu-mehra The following labels will be automatically applied to this pull request:

  • hotspot
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Aug 23, 2022
@ashu-mehra
Copy link
Contributor Author

/contributor add zhengyu123

@openjdk
Copy link

openjdk bot commented Aug 23, 2022

@ashu-mehra [zhengyu123](https://github.com/zhengyu123) was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@mlbridge
Copy link

mlbridge bot commented Aug 23, 2022

Webrevs

@ashu-mehra
Copy link
Contributor Author

/contributor add @zhengyu123

@openjdk
Copy link

openjdk bot commented Aug 23, 2022

@ashu-mehra
Contributor Zhengyu Gu <zgu@openjdk.org> successfully added.

@ashu-mehra
Copy link
Contributor Author

@stefank @fisk I see this code [1] which indicates ZGC does not require barriers at all. I would appreciate if either of you can provide inputs on how does ZGC handles the scenario where the previously reachable object O1 is now reachable only from another object O2 allocated after marking has started. How would O1 be marked by ZGC in such case?
This seems to be the scenario we are hitting in the context of loom code during the freeze process and shenandoah IU mode fails to handle it.

[1]

_barriers = !UseZGC && chunk->requires_barriers();

Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
@stefank
Copy link
Member

stefank commented Aug 29, 2022

@stefank @fisk I see this code [1] which indicates ZGC does not require barriers at all. I would appreciate if either of you can provide inputs on how does ZGC handles the scenario where the previously reachable object O1 is now reachable only from another object O2 allocated after marking has started. How would O1 be marked by ZGC in such case? This seems to be the scenario we are hitting in the context of loom code during the freeze process and shenandoah IU mode fails to handle it.

[1]

_barriers = !UseZGC && chunk->requires_barriers();

This is our thinking w.r.t ZGC:

  1. ZGC only returns memory from what we call "allocating" ZPages. Such pages are never part of an on-going marking cycle, and therefore don't need any barriers.
  2. There's no safepoint after the memory has been allocated via chunk_oop = allocator.allocate(); // can safepoint

So, when we reach the line you linked to, we are guaranteed that we don't need to perform any GC barriers on that object.

Now, for Shenandoah IU mode, AFAIU (drawing parallels to CMS), the raw initializing stores are problematic, because you don't register those oops anywhere:

  // fields are uninitialized
  chunk->set_parent_raw<typename ConfigT::OopT>(_cont.last_nonempty_chunk());
  chunk->set_cont_raw<typename ConfigT::OopT>(_cont.continuation());

You can also see how that comment about the fields are uninitialized, refers to SATB marking, and most-likely uses that as a rationale for skipping store barriers.

In Generational ZGC, we also introduced store barriers. It's not similar to IU, but you can see how we had to changed from a raw store, to a store with a barrier:
https://github.com/openjdk/zgc/blob/f0b25d9339104a80f903d889a7939dd623c76867/src/hotspot/share/runtime/continuationFreezeThaw.cpp

  // Allocate the chunk.
  //
  // This might safepoint while allocating, but all safepointing due to
  // instrumentation have been deferred. This property is important for
  // some GCs, as this ensures that the allocated object is in the young
  // generation / newly allocated memory.
  StackChunkAllocator allocator(klass, size_in_words, current, stack_size, _cont, _jvmti_event_collector);
  stackChunkOop chunk = allocator.allocate();

  if (chunk == nullptr) {
    return nullptr; // OOME
  }
...
#if INCLUDE_ZGC
  if (UseZGC) {
    // fields are uninitialized
    chunk->set_parent_access<IS_DEST_UNINITIALIZED>(_cont.last_nonempty_chunk());
    chunk->set_cont_access<IS_DEST_UNINITIALIZED>(_cont.continuation());
    ZStackChunkGCData::initialize(chunk);

    assert(!chunk->requires_barriers(), "ZGC always allocates in the young generation");
    _barriers = false;
  } else
#endif

@fisk
Copy link
Contributor

fisk commented Aug 29, 2022

@stefank @fisk I see this code [1] which indicates ZGC does not require barriers at all. I would appreciate if either of you can provide inputs on how does ZGC handles the scenario where the previously reachable object O1 is now reachable only from another object O2 allocated after marking has started. How would O1 be marked by ZGC in such case? This seems to be the scenario we are hitting in the context of loom code during the freeze process and shenandoah IU mode fails to handle it.

[1]

_barriers = !UseZGC && chunk->requires_barriers();

This is our thinking w.r.t ZGC:

  1. ZGC only returns memory from what we call "allocating" ZPages. Such pages are never part of an on-going marking cycle, and therefore don't need any barriers.

  2. There's no safepoint after the memory has been allocated via chunk_oop = allocator.allocate(); // can safepoint

So, when we reach the line you linked to, we are guaranteed that we don't need to perform any GC barriers on that object.

Now, for Shenandoah IU mode, AFAIU (drawing parallels to CMS), the raw initializing stores are problematic, because you don't register those oops anywhere:


  // fields are uninitialized

  chunk->set_parent_raw<typename ConfigT::OopT>(_cont.last_nonempty_chunk());

  chunk->set_cont_raw<typename ConfigT::OopT>(_cont.continuation());

You can also see how that comment about the fields are uninitialized, refers to SATB marking, and most-likely uses that as a rationale for skipping store barriers.

In Generational ZGC, we also introduced store barriers. It's not similar to IU, but you can see how we had to changed from a raw store, to a store with a barrier:

https://github.com/openjdk/zgc/blob/f0b25d9339104a80f903d889a7939dd623c76867/src/hotspot/share/runtime/continuationFreezeThaw.cpp


  // Allocate the chunk.

  //

  // This might safepoint while allocating, but all safepointing due to

  // instrumentation have been deferred. This property is important for

  // some GCs, as this ensures that the allocated object is in the young

  // generation / newly allocated memory.

  StackChunkAllocator allocator(klass, size_in_words, current, stack_size, _cont, _jvmti_event_collector);

  stackChunkOop chunk = allocator.allocate();



  if (chunk == nullptr) {

    return nullptr; // OOME

  }

...

#if INCLUDE_ZGC

  if (UseZGC) {

    // fields are uninitialized

    chunk->set_parent_access<IS_DEST_UNINITIALIZED>(_cont.last_nonempty_chunk());

    chunk->set_cont_access<IS_DEST_UNINITIALIZED>(_cont.continuation());

    ZStackChunkGCData::initialize(chunk);



    assert(!chunk->requires_barriers(), "ZGC always allocates in the young generation");

    _barriers = false;

  } else

#endif

The neat thing with using the right Access API stores instead of an ad hoc barrier, is that it works for both shenandoah IU mode, and generational ZGC, while the ad hoc barrier does not.

@ashu-mehra
Copy link
Contributor Author

The neat thing with using the right Access API stores instead of an ad hoc barrier, is that it works for both shenandoah IU mode, and generational ZGC, while the ad hoc barrier does not.

@fisk are you suggesting that changing the Access API from set_parent_raw to set_parent_access is the right approach? This introduces gc policy specific code which does not look clean.
It appears that the combination of requires_barriers() and do_barriers() API is suppose to encapsulate the gc policy specific requirements. requires_barriers() would handle any gc specific requirement for using barriers and do_barriers() would then take the necessary action. This makes sense as it keeps the code clean. wdyt?

@fisk
Copy link
Contributor

fisk commented Aug 30, 2022

The neat thing with using the right Access API stores instead of an ad hoc barrier, is that it works for both shenandoah IU mode, and generational ZGC, while the ad hoc barrier does not.

@fisk are you suggesting that changing the Access API from set_parent_raw to set_parent_access is the right approach? This introduces gc policy specific code which does not look clean. It appears that the combination of requires_barriers() and do_barriers() API is suppose to encapsulate the gc policy specific requirements. requires_barriers() would handle any gc specific requirement for using barriers and do_barriers() would then take the necessary action. This makes sense as it keeps the code clean. wdyt?

That is precisely what I am suggesting. The situation is the opposite from what you think. The Access API is the public interface to the GC, that will do the right thing for all GCs, while the requires_barriers and do_barriers stuff is less general, as I just discussed.

@ashu-mehra
Copy link
Contributor Author

@fisk you are right, I see now that set_parent_access<IS_DEST_UNINITIALIZED> is a no-op for satb barriers. So we can just change set_parent_raw to set_parent_access<IS_DEST_UNINITIALIZED> for all gc policies, right?
btw set_parent_access<decorators> is missing in the mainline, so I would have to add this API (and the corresponding API oop::obj_field_put_access) as well. Does that make sense?

@fisk
Copy link
Contributor

fisk commented Aug 30, 2022

@fisk you are right, I see now that set_parent_access<IS_DEST_UNINITIALIZED> is a no-op for satb barriers. So we can just change set_parent_raw to set_parent_access<IS_DEST_UNINITIALIZED> for all gc policies, right?

btw set_parent_access<decorators> is missing in the mainline, so I would have to add this API (and the corresponding API oop::obj_field_put_access) as well. Does that make sense?

That sounds great! :)

@ashu-mehra
Copy link
Contributor Author

Closing it in favor of #10089

@ashu-mehra ashu-mehra closed this Sep 7, 2022
@ashu-mehra ashu-mehra deleted the JDK-8288129 branch February 7, 2023 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review shenandoah shenandoah-dev@openjdk.org
3 participants