Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8256811: Delayed/missed jdwp class unloading events #9168

Conversation

zhengyu123
Copy link
Contributor

@zhengyu123 zhengyu123 commented Jun 15, 2022

Currently, jdi only check and process class unloading event when it detects a new GC cycle.

After JDK-8212879, posting class events can overlap with GC finish event, that results, sometimes, it only captures partial or even empty unloaded class list. The pending list usually can be flushed out at next GC cycle. But for the classes unloaded during the last GC cycle, the class unloading events may lost forever.

This patch checks and processes class unloading events unconditionally, suggested by @kbarrett, the last pending unloaded class list can be flushed by other events, such as VM_DEATH.

It also performs commonRef_compact() only when there are classes unloaded.

New test failed about 20% without patch, none with patch.

Update
There are significant changes from early patch.

The new approach:
No longer removing dead objects and post events on VM thread. I believe it was implemented this way to workaround the following issues:

  • JDI event handler uses JVMTI raw monitor, it requires thread in _in_native state
  • The thread can not hold lock, which is needed to protect JvmtiTagMap while walking, when transition to _in_native state

The new solution breaks up into two steps:

  • Collect all dead object tags with lock
  • Transition to _in_native state and post object free events in one batch

This way, JDI event handler can process object free events upon arrivals without delay.

Update 2
There is a comment for JvmtiTagMap::check_hashmap() that states ObjectFree events are posted before heap walks.

// This checks for posting and rehashing before operations that
// this tagmap table.  The calls from a JavaThread only rehash, posting is
// only done before heap walks.
void JvmtiTagMap::check_hashmap(bool post_events) {

Now, the events are actually posted after heap walks, but I don't think it makes any material material difference.
Even the events are posted earlier in old code, but they are only processed after next GC cycle.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8256811: Delayed/missed jdwp class unloading events

Reviewers

Contributors

  • Chris Plummer <cjplummer@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9168/head:pull/9168
$ git checkout pull/9168

Update a local copy of the PR:
$ git checkout pull/9168
$ git pull https://git.openjdk.org/jdk pull/9168/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9168

View PR using the GUI difftool:
$ git pr show -t 9168

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9168.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 15, 2022

👋 Welcome back zgu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 15, 2022
@openjdk
Copy link

openjdk bot commented Jun 15, 2022

@zhengyu123 The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Jun 15, 2022
@mlbridge
Copy link

mlbridge bot commented Jun 15, 2022

Copy link
Contributor

@plummercj plummercj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I have with this fix is that is doesn't fix the "delay" issue. It helps it, and ensures the CLASS_UNLOAD events are sent before the debuggee exits, but it doesn't guarantee that they are sent in a time manner.

Your test is somewhat contrived in that the debuggee only runs for as long as it takes to look up all the test classes. Then it exits, generating a VM_DEATH, and guaranteeing that by that point all CLASS_UNLOAD events have been sent (and without delay). However, we also have the issue of a JDK-8257705, which is a test that times out waiting for a CLASS_UNLOAD event, because there isn't another event to trigger sending it. Since the debuggee does not exit until after the test is done (and has timed out), the VM_DEATH comes too late.

JDK-8282459 offers other suggestions for the fix, albeit more complex than your. Please consider the one mentioned in the first comment, which is basically adding a watchdog thread.

@zhengyu123
Copy link
Contributor Author

One issue I have with this fix is that is doesn't fix the "delay" issue. It helps it, and ensures the CLASS_UNLOAD events are sent before the debuggee exits, but it doesn't guarantee that they are sent in a time manner.

Yes, it improves timing in some cases, but not for all.

Your test is somewhat contrived in that the debuggee only runs for as long as it takes to look up all the test classes. Then it exits, generating a VM_DEATH, and guaranteeing that by that point all CLASS_UNLOAD events have been sent (and without delay). However, we also have the issue of a JDK-8257705, which is a test that times out waiting for a CLASS_UNLOAD event, because there isn't another event to trigger sending it. Since the debuggee does not exit until after the test is done (and has timed out), the VM_DEATH comes too late.

JDK-8256811 offers other suggestions for the fix, albeit more complex than your. Please consider the one mentioned in the first comment, which is basically adding a watchdog thread.

I could not seem to find the other solution in JDK-8256811 comments.

@plummercj
Copy link
Contributor

I could not seem to find the other solution in JDK-8256811 comments.

Sorry, I meant JDK-8282459. I fixed my comment above.

@openjdk
Copy link

openjdk bot commented Jun 23, 2022

⚠️ @zhengyu123 This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@zhengyu123
Copy link
Contributor Author

/label add hotspot-runtime

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Jun 23, 2022
@sspitsyn
Copy link
Contributor

Thanks.
I see my comment as pending but have no idea why it was not posted.
Let me just repeat it again.

src/hotspot/share/prims/jvmtiTagMap.cpp#diff
@@ -1330,7 +1352,7 @@ jvmtiError JvmtiTagMap::get_objects_with_tags(const jlong* tags,
    entry_iterate(&collector);
  }
  if (collector.some_dead_found() && env()->is_enabled(JVMTI_EVENT_OBJECT_FREE)) {
 -   post_dead_objects_on_vm_thread();
 +  remove_and_post_dead_objects();

The function remove_and_post_dead_objects() is also called from the flush_object_free_events(). This function is using some sync protocol. It is waiting under protection of the monitor lock() for _posting_events to become false and then sets _posting_events to true. It is also notifying the other such waiting threads after clearing the _posting_events. Should the function get_objects_with_tags() follow the same protocol? Do I understand it right that the goal was to serialize the remove_and_post_dead_objects() calls? If not then why do we need this sync protocol in function flush_object_free_events()?

@plummercj
Copy link
Contributor

I see my comment as pending but have no idea why it was not posted.

Your comment is pending because you clicked on "Start a review" instead of "Add single comment", and then never finished the review by clicking on "Review changes".

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 20, 2022

I see my comment as pending but have no idea why it was not posted.

Your comment is pending because you clicked on "Start a review" instead of "Add single comment", and then never finished the review by clicking on "Review changes".

I never click "Start a review". I tried to inline my comment related to the source fragment.
But you can be right. As I see a mark "started a review".

@zhengyu123
Copy link
Contributor Author

zhengyu123 commented Jul 20, 2022

get_objects_with_tags

@sspitsyn Good catch!

I believe it should be synchronized if JvmtiEnv::GetObjectsWithTags() call can overlap disabling/enabling ObjectFree event.

But it raises a question: why it needs remove_and_post_dead_objects() here? JvmtiEnv::GetObjectsWithTags() returns live objects, posting ObjectFree event does not have immediate effect due to queuing in old implementation. Could we have ServiceThread to deal with it instead?

Thanks.

@coleenp

if (collector.some_dead_found() && env()->is_enabled(JVMTI_EVENT_OBJECT_FREE)) {
   post_dead_objects_on_vm_thread();

came in from JDK-8212879, could you comment why it is needed?

Thanks!

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 21, 2022

Actually I'm going to amend my previous comment about concerns with synchronization of the test. I think once the SetEventNotificationMode() call returns (after passing JVMTI_DISABLE), the caller should be guaranteed that any call to the event handler that was in progress has completed. So the caller should be able to safely free the raw monitor that is used in the callback. I'm not sure if JVMTI makes this guarantee however.

The following RN is related to this question:
https://bugs.openjdk.org/browse/JDK-8256558
It explicitly says: SetNotificationMode can be used to explicitly flush ObjectFree events, if needed.
This RN is related to the bug pointed out by @zhengyu123 above: JDK-8212879
However, I feel that we still need to test it heavily to make sure there are no surprises.

@plummercj
Copy link
Contributor

The following RN is related to this question:
https://bugs.openjdk.org/browse/JDK-8256558
It explicitly says: SetNotificationMode can be used to explicitly flush ObjectFree events, if needed.

That wasn't quite what I was getting at. My comment / question was much more general. If you call SetEventNotificationMode() to disable a specific event, once it returns are you guaranteed that no more events will be delivered. This means that if JVMTI was in the middle of calling the event handler at the same time that SetEventNotificationMode() is called, then before returning, SetEventNotificationMode() will block until the event handler returns.

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 21, 2022

The following RN is related to this question:
https://bugs.openjdk.org/browse/JDK-8256558
It explicitly says: SetNotificationMode can be used to explicitly flush ObjectFree events, if needed.

That wasn't quite what I was getting at. My comment / question was much more general. If you call SetEventNotificationMode() to disable a specific event, once it returns are you guaranteed that no more events will be delivered. This means that if JVMTI was in the middle of calling the event handler at the same time that SetEventNotificationMode() is called, then before returning, SetEventNotificationMode() will block until the event handler returns.

There is no such general guarantee with calls to the SetEventNotificationMode. It is not in the JVM TI spec. My view is that it is racy by the JVM TI design. I spent some time in the past to fix several problems related to the JDWP agent shutdown as the WRONG_PHASE error code was appearing in the middle of an event handler execution. The WRONG_PHASE problem is similar kind of sync problem. JVM TI does not guarantee that all event callbacks are completed before the JVM TI phase is switched to JVMTI_PHASE_DEAD.

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 21, 2022

get_objects_with_tags

@sspitsyn Good catch!

I believe it should be synchronized if JvmtiEnv::GetObjectsWithTags() call can overlap disable/enable ObjectFree event.

But it raises a question: why it needs remove_and_post_dead_objects() here? JvmtiEnv::GetObjectsWithTags() returns live objects, posting ObjectFree event does not have immediate effect due to queuing in old implementation. Could we have ServiceThread to deal with it instead?

Thanks.

@coleenp

if (collector.some_dead_found() && env()->is_enabled(JVMTI_EVENT_OBJECT_FREE)) {
   post_dead_objects_on_vm_thread();

came in from JDK-8212879, could you comment why it is needed?

Thanks!

This is a reasonable question. I do not remember the exact motivation to add this call.
There is no CSR filed for this behavioral change. The GetObjectsWithTags spec says nothing about it.
It could be added because some of the tests started failing.
Another quess is that the ObjectFree events can be lost otherwise (it is not easy to prove if it is possible).
Let's wait for Coleen's answer.

@coleenp
Copy link
Contributor

coleenp commented Jul 21, 2022

Hi, it was added because some tests were failing. With the concurrent posting, we were missing events and flushing them made the tests pass.

@zhengyu123
Copy link
Contributor Author

zhengyu123 commented Jul 21, 2022

Hi, it was added because some tests were failing. With the concurrent posting, we were missing events and flushing them made the tests pass.

Thanks, @coleenp
Can you recall what were those tests?

I removed the block and passed vmTestbase_nsk_jdi, vmTestbase_nsk_jdwp and hotspot_serviceability on Linux x86_64, I know Oracle has more tests.

@sspitsyn
Copy link
Contributor

Hi, it was added because some tests were failing. With the concurrent posting, we were missing events and flushing them made the tests pass.

Thanks, @coleenp Can you recall what were those tests?

I removed the block and passed vmTestbase_nsk_jdi, vmTestbase_nsk_jdwp and hotspot_serviceability on Linux x86_64, I know Oracle has more tests.

I'd suggest to run additionally: vmTestbase_nsk_jvmti, vmTestbase_nsk_jdb and jdk_jdi (jdk/com/sun/jdi).

@coleenp
Copy link
Contributor

coleenp commented Jul 21, 2022

@zhengyu123 try the tests with stress shendandoah flags if you have them. I think I had a lot of failures with ZGC and ZCG ZFragmentationLimit=0.1 or something like that.

@plummercj
Copy link
Contributor

@zhengyu123 try the tests with stress shendandoah flags if you have them. I think I had a lot of failures with ZGC and ZCG ZFragmentationLimit=0.1 or something like that.

JDK-8255987?

@plummercj
Copy link
Contributor

plummercj commented Jul 21, 2022

That wasn't quite what I was getting at. My comment / question was much more general. If you call SetEventNotificationMode() to disable a specific event, once it returns are you guaranteed that no more events will be delivered. This means that if JVMTI was in the middle of calling the event handler at the same time that SetEventNotificationMode() is called, then before returning, SetEventNotificationMode() will block until the event handler returns.

There is no such general guarantee with calls to the SetEventNotificationMode. It is not in the JVM TI spec. My view is that it is racy by the JVM TI design. I spent some time in the past to fix several problems related to the JDWP agent shutdown as the WRONG_PHASE error code was appearing in the middle of an event handler execution. The WRONG_PHASE problem is similar kind of sync problem. JVM TI does not guarantee that all event callbacks are completed before the JVM TI phase is switched to JVMTI_PHASE_DEAD.

This becomes a problem when the test tries to free a raw monitor on exit. It first disable events, then it frees the raw monitor that the even handler synchronizes on. However, it's possible that an event can be in flight when doing this (an event that was triggered before disabling events, but not fully delivered yet). So now you have a chicken-and-egg problem because you need a 2nd raw monitor to synchronize around freeing the 1st raw monitor. So the end result is you can't always safely cleanup on test exit the way we currently are. I think we are getting away with it because the set of events we expect is fixed or predictable. So once the test decides it is done, it won't get more events, even if they are not disabled. Thus we can free the raw monitor after disabling events without worry of some spurious event causing an untimely entry into the event handler. However, I thought there was one test you fixed where we had this exact problem, and you fixed it by not freeing the raw monitor on exit. ThreadStart and ThreadDeath events come to mind, since these are now notoriously unpredictable.

@zhengyu123
Copy link
Contributor Author

zhengyu123 commented Jul 21, 2022

@zhengyu123 try the tests with stress shendandoah flags if you have them. I think I had a lot of failures with ZGC and ZCG ZFragmentationLimit=0.1 or something like that.

JDK-8255987?

Tested on Linux x86_64 (fastdebug)
Test: vmTestbase_nsk_jdi, vmTestbase_nsk_jdwp, vmTestbase_nsk_jvmti vmTestbase_nsk_jdb' and jdk_jdi`

  • default, no failure
  • ShenandoahGC + aggressive, one timeout [1]
  • ZGC with -Xmx2g -XX:ZCollectionInterval=0.01, one timeout
  • ZGC with -XX:ZFragmentationLimit=0.1, no failure

[1] vmTestbase/nsk/jvmti/scenarios/allocation/AP04/ap04t001/TestDescription.java

I pushed update that removed the block, could you please run it through Oracle's internal tests?

Thanks!

@sspitsyn
Copy link
Contributor

This becomes a problem when the test tries to free a raw monitor on exit. It first disable events, then it frees the raw monitor that the even handler synchronizes on. However, it's possible that an event can be in flight when doing this (an event that was triggered before disabling events, but not fully delivered yet). So now you have a chicken-and-egg problem because you need a 2nd raw monitor to synchronize around freeing the 1st raw monitor. So the end result is you can't always safely cleanup on test exit the way we currently are. I think we are getting away with it because the set of events we expect is fixed or predictable. So once the test decides it is done, it won't get more events, even if they are not disabled. Thus we can free the raw monitor after disabling events without worry of some spurious event causing an untimely entry into the event handler. However, I thought there was one test you fixed where we had this exact problem, and you fixed it by not freeing the raw monitor on exit. ThreadStart and ThreadDeath events come to mind, since these are now notoriously unpredictable.

I fixed it this way for simplicity. However, it is still possible to correctly sync if necessary.

@sspitsyn
Copy link
Contributor

I pushed update that removed the block, could you please run it through Oracle's internal tests?

I think, you can run all the needed tests by yourself.
But I'll also submit mach5 jobs to be safe.

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 22, 2022

My mach5 debug test results looks good. Sorry for initial posting closed mach5 links.

Tested without extra flags: nsk.jvmti, nsk.jdi, nsk.jdb, jdk_jdi

With flags -XX:+UseZGC -XX:ZFragmentationLimit=0.1: nsk.jvmti, nsk.jdi
With flags -XX:+UseZGC -XX:ZCollectionInterval=0.1: nsk.jvmti, nsk.jdi
Some test runs on Window got HARNESS ERROR.
I'm not sure how to reproduce the issues Coleen observed when fixed the JDK-8255987 .

Approved the fix.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I believe you. Thanks for all the testing.

@zhengyu123
Copy link
Contributor Author

Thanks, @plummercj @sspitsyn and @coleenp

@zhengyu123
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Jul 22, 2022

Going to push as commit 54854d9.
Since your change was applied there have been 181 commits pushed to the master branch:

  • 75c1e99: 8289572: InputStream wrapping with BufferedInputStream is redundant in HttpTimestamper
  • 7841e5c: 8290464: Optimize ResourceArea zapping on ResourceMark release
  • e9f97b2: 8290489: Initial nroff manpage generation for JDK 20
  • 59d85ba: 8290687: serviceability/sa/TestClassDump.java could leave files owned by root on macOS
  • 66f59c2: 8290731: Clean up CDS handling of LambdaForm Species classes
  • 0dda3c1: 8289275: Remove incorrect __declspec(dllimport) attributes from pointers in jdk.crypto.cryptoki
  • 620c8a0: 8289643: File descriptor leak with ProcessBuilder.startPipeline
  • 7ec0132: 8286844: com/sun/jdi/RedefineCrossEvent.java failed with 1 threads completed while VM suspended
  • 80bd8c3: 8290504: Close streams returned by ModuleReader::list
  • 15f4b30: 8290115: ArrayCopyObject JMH has wrong package
  • ... and 171 more: https://git.openjdk.org/jdk/compare/0c37008917789e7b631b5c18e6f54454b1bfe038...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 22, 2022
@openjdk openjdk bot closed this Jul 22, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 22, 2022
@openjdk
Copy link

openjdk bot commented Jul 22, 2022

@zhengyu123 Pushed as commit 54854d9.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@sspitsyn
Copy link
Contributor

sspitsyn commented Jul 22, 2022

Ok. I believe you. Thanks for all the testing.

I'm afraid that the potential problem can be highly intermittent and did not show up in my testing.
To catch it we need to know the exact test and run it many times.
But if we discover such an issue then we can file a bug and address it separately.
Also, it can be that the change in SetEventNotificationMode (see https://bugs.openjdk.org/browse/JDK-8256558) is a reason why the issues are not reproducible anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

4 participants