New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8256020: Shenandoah: Don't resurrect objects during evacuation on AS_NO_KEEPALIVE #1113
Conversation
|
Webrevs
|
So your theory is that someone calls Dependencies::DepStream::recorded_oop_at on an nmethod, after marking terminated, leaking out a dead object. For your theory to be true, you would have acquired a is_unloading() nmethod from somewhere, and called Dependencies::DepStream::recorded_oop_at on it. That immediately excludes e.g. all on-stack nmethods, all nmethods handed out through dependency contexts, and all nmethods handed out through only_alive_and_not_unloading CodeCache iterators, which is almost all of them. There are very few code cache iterations that expose is_unloading() nmethods, and what they have in common is that they are not poking around at oops. So I suppose I really don't understand what path you could possibly track this to happen, where you have an is_unloading() nmethod, and start poking around at its oops. Would you mind elaborating a bit more, from what context you think Dependencies::DepStream::recorded_oop_at() is being called on an is_unloading() nmethod? |
I am digging a little deeper. Planting an assert in relevant place barrier shows that we are exposing an unmarked object in this code path in nmethod::flush_dependencies():
In other words, it gets the unmarked call_site, then does access it afterwards. TBH, I am not totally sure that we aren't doing something wrong somewhere. ? |
This is all very intentional. This is called by the class unloading code, and is the very reason why it is strong. The class unloading code, concurrent or STW doesn't matter, peeks and walks a chain of dead objects to find and clean up native dependency contexts of CallSites. This means that we have to expose the dead objects to the class unloading code, so that we can walk the dead objects, find the dependency context, and clean it. So if we would perform your requested change, we would probably crash here with SIGSEGV, trying to dereference NULL, as it is expected that the call_site is not NULL. |
Right. No, it doesn't crash there. It only ever fails our verification. |
Okay.
That is the right interpretation.
Sounds like the old MutableCallSite escaped the snapshot at the beginning somehow. Maybe it is related to your new reference processor? |
Hmm I think it is more subtle than that. |
Any non-raw access load should expose only the to-space object. But that is completely orthogonal to whether it should be marked or not. And obviously, having completely different semantics for accesses depending on whether the access is performed on a Java thread or not, is not a good idea. Sounds like the barrier code needs fixing. |
Why do these native-accesses be not-raw anyway? The trouble in Shenandoah is that if an object is unreachable, and we evacuate it, then it becomes 'live' at least in the sense that it is now beyond top-at-mark-start (TAMS), e.g. implicitely-live. I believe this is what the verifier is ultimately complaining about. This is not a marking issue. I have a fix that I'll push shortly after I did more testing. It does return the naked (from-space) oop when encountering AS_NO_KEEPALIVE on an unmarked object. That seems to fix this particular testcase as seems to be in the spirit of AS_NO_KEEPALIVE, assuming that AS_NO_KEEPALIVE access does not do anything nasty like storing the oop elsewhere. |
…ng evacuation on AS_NO_KEEPALIVE
// Prevent resurrection of unreachable objects that are visited during | ||
// concurrent class-unloading. | ||
if (HasDecorator<decorators, AS_NO_KEEPALIVE>::value && obj != NULL && | ||
_heap->has_forwarded_objects() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has_forwarded_objects()
check here? It seems to me that without forwarded objects, the load_reference_barrier
below would return the same obj
anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because we check if the object is (un-)marked and that would be unreliable outside of has_forwarded. Also, secondarily, because this block is intended to prevent resurrection-by-evacuation, which would not happen outside of has_forwarded. We could, infact, narrow it to is_evacuation_in_progress() I guess.
@rkennke This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 4 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the
|
/integrate |
@rkennke Since your change was applied there have been 4 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 3c3469b. |
In Shenandoah-testing, we noticed that compiler/jsr292/CallSiteDepContextTest.java fails with the following error:
In other words, a reachable (marked) MutableCallSite references an unreachable DirectMethodHandle. That reference would subsequently become dangling and lead to crashes if accessed.
I narrowed it down to the access in Dependencies::DepStream::recorded_oop_at(int i) which is done as 'strong', which means that it would return the reference even if it is unreachable, e.g. during concurrent class-unloading. This resurrection of the unreachable DMH is potentially fatal: eventually the reference will become dangling (after GC) and lead to crashes when accessed. I believe that access should be 'phantom' instead which causes GCs like Shenandoah and ZGC to return NULL when encountering unreachable objects.
(Notice that the bug only manifested after JDK-8255691, we accidentally applied the resurrection-preventing weak-LRB on strong access too)
Testing:
Progress
Testing
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/1113/head:pull/1113
$ git checkout pull/1113