8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent#20088
8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent#20088plummercj wants to merge 5 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back cjplummer! A progress list of the required criteria for merging this PR into |
|
@plummercj This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 205 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@plummercj The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
kevinjwalls
left a comment
There was a problem hiding this comment.
Read through and counted suspend/resumes on my fingers, seems good.
| break label1; | ||
| } | ||
|
|
||
| // We need to resume the main thread because thread2 might be blocked on it, |
There was a problem hiding this comment.
This does not look correct to me.
This is the last test scenario - thread2.resume should resumes the thread while vm is suspended.
thread2 should not be blocked on main thread.
Looking at the debuggee I suppose the blocking is possible during logging. I'd suggest to update the debugee and remove any logging between breakpoints 2 and 3
There was a problem hiding this comment.
This does not look correct to me.
This is the last test scenario - thread2.resume should resumes the thread while vm is suspended.
thread2 should not be blocked on main thread.
Looking at the debuggee I suppose the blocking is possible during logging. I'd suggest to update the debugee and remove any logging between breakpoints 2 and 3
It looks like the debuggee gets as far as the following:
public void runt2() {
log("method 'runt2' enter 1");
i1++;
i2--;
log("method 'run2t' exit");
return;
}
It prints the first log and hits a breakpoint setup on the 2nd line. The debugger resumes thread2 after this, but we never see the 2nd log. Whenever we see this failure, the following logs from the mainThread are always delayed (by a lot):
debugee.stderr> **> mainThread: mainThread is out of: synchronized (lockingObject) {
debugee.stderr> **> mainThread: waiting for an instruction from the debugger ...
I think this delay is resulting in the the mainThread being in the middle of doing one of these logs when the vm.suspend() is done. This leaves mainThread suspended while holding a lock needed for doing logging (logging is just a simple System.err.prinln()). I'm trying to prove this by getting a debuggee thread dump when getting the 3rd Breakpoint event times out, but for some reason once I added this code I could no longer reproduce the problem (still trying though).
I don't like the idea of avoiding this issue by getting rid of certain problematic logging. It seems error prone. Someone could add some new logging in the future. I'll see if there is a better solution than the vm.resume(). Perhaps I could just resume mainThread. However, I think with virtual threads I/O can be dependent on other threads like an "unparker" thread.
Another solution might be to have the debugger and debuggee do an additional handshake so we can guarantee that mainThread is done with these two log statements. Currently when we get to the 2nd log statement, that just means the debuggee is waiting for a "quit" command from the debugger. We could at this point have the debuggee send a command to the debugger, and have the debugger wait for this command before it does the vm.suspend().
There was a problem hiding this comment.
I was finally able to reproduce the issue with the stack dumping support in place. mainThread is in the middle of printing the 1st of the two logs mention above. mainThead is suspended and is holding a println related lock, and thread2 is stuck on the 2nd log call in runt2 awaiting the same lock.
There was a problem hiding this comment.
I was able to add synchronization between the debugger and debuggee to fix this issue, but I don't really like the solution. It just adds more complexity and makes the test even harder to follow. What I'd like to do is just put a short sleep in before the vm.suspend(). Let me know if you think this is ok and I'll update the PR with the changes.
There was a problem hiding this comment.
Thank you for the confirming the reason of the timeout.
To be more clear about my point:
The test has 3 scenarios (see the test description):
ThreadReference.resume() resumes the thread suspended with:
-
- with thread2.suspend()
- with thread2.suspend()
-
- at a breakpoint
- at a breakpoint
-
- with VirtualMachine.suspend()
- with VirtualMachine.suspend()
So for 3rd scenario we should not call vm.resume() (as it converts 3rd scenario to 1st scenario).
The test can be fixed by different ways, to me remove logging between breakpoint2 and breakpoint3 is the simplest way.
Note that breakpoint2 is "runt2(), line 2" and breakpoint3 is "runt1(), line 7", there are 2 log statements. We can move breakpoint 3 to "runt2(), line 3" (I don't see much sense to have breakpoint 3 so far from breakpoint2 - we just need to ensure the thread was resumed )
There was a problem hiding this comment.
Which logging should be removed? Alex suggest the logging between breakpoints 2 and 3, but even that is not enough. There is logging after breakpoint 3, and that happens before the vm.resume() is done. I'm not saying this can't be done right, but I think pruning out logging like this in order to fix the problem not only removes some valuable logging from the test, but still leaves us open to this type of issue.
I think the safer thing to do is to make sure mainThread is no longer logging (or will attempt to log) when the vmsuspend is done. This could be done by pruning some of its logging, but that has the same negatives as removing some thread2 logging. My sleep suggestion is by far the simplest. The "purist" fix would probably be the checkpoint approach (don't do the vm.suspend until mainThread has reached a stable point). That ended up getting a bit uglier than I had hoped, but I can finish up the work so you can have a look at it if you'd like.
There was a problem hiding this comment.
Sorry I'm unclear on the different threads involved here. IIUC the vm.suspend comes from the debugger, and the mainthread and thread-2 are both threads in the debuggee, being suspended at different times?
There was a problem hiding this comment.
Yes. thread2 is suspended via breakpoint (multiple times). mainThread is suspended by the one place in the test that does a vm.suspend(), which is near the end of the test. This is the problematic suspend because sometimes it is done while mainThread is in the middle of a println. A bit later thread2 is resumed and ends up blocking on a println due to mainThread holding the needed lock.
There was a problem hiding this comment.
I've updated the implementation so now it does a sync point after mainThread is done with printlns.
There was a problem hiding this comment.
Which logging should be removed? Alex suggest the logging between breakpoints 2 and 3, but even that is not enough. There is logging after breakpoint 3, and that happens before the vm.resume() is done.
After breakpoint 3 debugger calls thread2.resume() and vm.resume(), so there is no issue there (all threads are resumed).
…read is no longer doing a println()
|
I'm ok with the new solution. |
| vm.resume(); | ||
| vm.resume(); // for case error when both VirtualMachine and the thread2 were suspended |
There was a problem hiding this comment.
Pre-existing but I don't understand the comment. Why would you need 2 vm.resume() here? If thread2 was suspended directly don't you need a thread2.resume()?
There was a problem hiding this comment.
First just to clarify a general JDI feature about thread suspending and resuming. You can undo a ThreadReference.suspend() or a thread suspended as the result of an event by dong a vm.resume(). This is documented in the JDI API spec, which talks about suspendCounts and how various APIs and event delivery affect them.
I was tempted to clean up these vm.resume() calls but opted not to. The point being made in the comment is that worse case thread2 was suspended by a breakpoint or thread2.suspend() and all threads were suspended by a vm.resume() (meaning thread2 could have a suspendCount of 2). Two vm.resumes() take are done to make sure thread2 gets resumed under this situation. However, one of the vm.resume calls could instead be a thread2.resume(). Doing two vm.resume() calls was probably just laziness by the original authors. It works though.
However, by my accounting at any failure point thread2 never has a suspendCount > 1, so really just one vm.resume() would be enough.
The original code did these two vm.resume() calls unconditionally, but they are not needed if there was no error.
There was a problem hiding this comment.
The original code had 2 vm.resume() - one on them to match vm.suspend() and 2nd one to allow debugee to continue on error.
Now we have 3 vm.resume() - one is to match vm.suspend() (line 377) and 2 conditional (on error).
Theoretically we can get an error when both vm and thread2 are suspended, so 2 vm.resume() looks reasonable.
Anyway resume() is a nop if the thread is not suspended
There was a problem hiding this comment.
After reaching the 2nd breakpoint, which suspends thread2, we do a vm.suspend(), which bumps the thread2 suspendCount to 2. However, we do a eventSet.resume() after this, lowering the suspendCount to 1, and there is no error bailout point while the suspendCount is 2. Thus only 1 vm.resume() is needed in the error handling.
There was a problem hiding this comment.
I think all this discussion about the number of vm.resume() calls that are needed
or not needed and the fact that one of those vm.resume() calls could be replaced
by a thread2.resume() call perfectly illustrates just how complicated this test is.
Thanks for going thru the effort to get rid of the sleep() call. I appreciate it.
dholmes-ora
left a comment
There was a problem hiding this comment.
FWIW I think the explicit sync with the mainthread seems reasonable too.
|
I agree with Chris that this test is over-complicated. |
|
Is there any chance that all the |
There was only one and I fixed it already. |
| * Print information about all threads in debuggee VM | ||
| */ | ||
| protected void printThreadsInfo(VirtualMachine vm) { | ||
| public void printThreadsInfo(VirtualMachine vm) { |
There was a problem hiding this comment.
| public void printThreadsInfo(VirtualMachine vm) { | |
| public void printThreadsInfo(VirtualMachine vm) { |
…nce it is now general purpose and not just used when killing debuggee.
|
Thanks for the reviews from Kevin, Alex, and Serguei and input from David, Dan, and Andrey. /integrate |
|
Going to push as commit c8a95a7.
Your commit was automatically rebased without conflicts. |
|
@plummercj Pushed as commit c8a95a7. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
The test hits a breakpoint on thread2 with SUSPEND_EVENT_THREAD policy, so only thread2 is suspended. It then does a vm.suspend(), which suspends all threads and bumps the suspendCount of thread2 up to 2. It then does an eventSet.resume(), which decrements the thread2 suspendCount to 1, so now all threads are suspended with a suspendCount of 1. thread2 is then resumed and we expect to hit the next thread2 breakpoint. The problem is that thread2 can't hit the breakpoint until the main thread has proceeded far enough, and if the vm.suspend() that suspended the main thread happens too quickly, it won't have proceeded far enough, so thread2 never hits the breakpoint.
Essentially we need a vm.resume() to allow the main thread to run, but we need to do it in a way that does nullify part of what the test is testing for. So in order to allow the vm.resume() but not subvert the intent of the test, we first do a thread2.suspend() so the vm.resume() won't resume thread2.
Testing in progress: tier1 and tier5 svc testing
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20088/head:pull/20088$ git checkout pull/20088Update a local copy of the PR:
$ git checkout pull/20088$ git pull https://git.openjdk.org/jdk.git pull/20088/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20088View PR using the GUI difftool:
$ git pr show -t 20088Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20088.diff
Webrev
Link to Webrev Comment