Skip to content

8274687: JDWP can deadlock VM if Java thread waits in blockOnDebuggerSuspend #5805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

reinrich
Copy link
Member

@reinrich reinrich commented Oct 4, 2021

This change fixes the deadlock described in the JBS-bug by:

  • Releasing handlerLock before waiting on threadLock in blockOnDebuggerSuspend()

  • Notifying on threadLock after resuming all threads in threadControl_reset()

The PR has 3 commits:

The first commit delays the cleanup actions after leaving the loop in
debugLoop_run() because of a Dispose command. The delay allows to reproduce the deadlock
running the dispose003 test with the command

make run-test TEST=test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003

The second commit contains the fix described above. With it the deadlock
cannot be reproduced anymore.

The third commit removes the diagnostic code introduced with the first commit again.

The fix passed our nightly regression testing: JCK and JTREG, also in Xcomp mode with fastdebug and release builds on all platforms.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8274687: JDWP can deadlock VM if Java thread waits in blockOnDebuggerSuspend

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5805/head:pull/5805
$ git checkout pull/5805

Update a local copy of the PR:
$ git checkout pull/5805
$ git pull https://git.openjdk.java.net/jdk pull/5805/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5805

View PR using the GUI difftool:
$ git pr show -t 5805

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5805.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 4, 2021

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 4, 2021

@reinrich The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Oct 4, 2021
@reinrich reinrich marked this pull request as ready for review October 4, 2021 14:20
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 4, 2021
@mlbridge
Copy link

mlbridge bot commented Oct 4, 2021

Webrevs

@@ -746,7 +746,16 @@ blockOnDebuggerSuspend(jthread thread)
node = findThread(NULL, thread);
if (node != NULL) {
while (node && node->suspendCount > 0) {
/* Resume requires the event handlerLock so we have to release it */
eventHandler_unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure this is always safe. It might be fine in the context of resetting the connection, but not during normal debug agent operations. It allows for another event to be processed when the lock is suppose to keep event processing serialized. What happens for example if we hit the Thread.resume() breakpoint again on a different thread?

It might be best to limit doing this only when threadControl_reset() is currently executing (you could set a flag there), although it seems more like a band aid than a proper fix. I could imagine there might still be scenarios were releasing the lock during reset might be problematic, although probably extremely unlikely to ever be noticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure this is always safe. It might be fine in the context of
resetting the connection, but not during normal debug agent operations. It
allows for another event to be processed when the lock is suppose to keep
event processing serialized. What happens for example if we hit the
Thread.resume() breakpoint again on a different thread?

I agree that this is probably not safe. E.g. when releasing handlerLock the
handler chain for EI_BREAKPOINT could be modified which is being iterated in
the caller function event_callback().

node = getHandlerChain(evinfo->ei)->first;
classname = getClassname(evinfo->clazz);
while (node != NULL) {
/* save next so handlers can remove themselves */
HandlerNode *next = NEXT(node);
jboolean shouldDelete;
if (eventFilterRestricted_passesFilter(env, classname,
evinfo, node,
&shouldDelete)) {
HandlerFunction func;
func = HANDLER_FUNCTION(node);
if ( func == NULL ) {
EXIT_ERROR(AGENT_ERROR_INTERNAL,"handler function NULL");
}
(*func)(env, evinfo, node, eventBag);
}
if (shouldDelete) {
/* We can safely free the node now that we are done
* using it.
*/
(void)freeHandler(node);
}
node = next;
}

I should have checked before.

It might be best to limit doing this only when threadControl_reset() is
currently executing (you could set a flag there), although it seems more like
a band aid than a proper fix. I could imagine there might still be scenarios
were releasing the lock during reset might be problematic, although probably
extremely unlikely to ever be noticed.

I looked at threadControl_resumeThread(). It appears to me that resuming a
thread is not possible while some thread is waiting in blockOnDebuggerSuspend()
because threadControl_resumeThread() locks handlerLock.

eventHandler_lock(); /* for proper lock order */

Am I missing something?

Maybe the code in handleAppResumeBreakpoint() could moved to doPendingTasks()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding threadControl_resumeThread() it does appear that it would block, as would threadControl_resumeAll(), which seems problematic in that blockOnDebuggerSuspend() won't exit until the suspendCount == 0. So it's unclear to me how this is suppose to work if a resume() can't be done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems if we had a test case where the debugger did a ThreadReference.suspend(), then the debuggee did a Thread.resume() on the same thread, and then debugger did a ThreadReference.resume(), it would block, and with no way to unblock it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can implement such a test. Also I have I prototype that defers the actions of handleAppResumeBreakpoint() until doPendingTasks() is running. This seems to work in the various jdi/jdwp tests which likely do not cover the functionality very well though.

@reinrich
Copy link
Member Author

reinrich commented Oct 6, 2021

I've implemented a test that shows the jdwp agent deadlocks also trying to resume one/all threads if a thread is waiting in blockOnDebuggerSuspend. I will open a new pr that will include the test and a new attempt to fix the issues.

@reinrich reinrich closed this Oct 6, 2021
@reinrich reinrich deleted the 8274687_jdwp_deadlock_in_blockOnDebuggerSuspend branch October 20, 2022 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Pull request is ready for review serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

2 participants