Skip to content

8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent#20088

Closed
plummercj wants to merge 5 commits intoopenjdk:masterfrom
plummercj:8072701_resume01
Closed

8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent#20088
plummercj wants to merge 5 commits intoopenjdk:masterfrom
plummercj:8072701_resume01

Conversation

@plummercj
Copy link
Contributor

@plummercj plummercj commented Jul 9, 2024

The test hits a breakpoint on thread2 with SUSPEND_EVENT_THREAD policy, so only thread2 is suspended. It then does a vm.suspend(), which suspends all threads and bumps the suspendCount of thread2 up to 2. It then does an eventSet.resume(), which decrements the thread2 suspendCount to 1, so now all threads are suspended with a suspendCount of 1. thread2 is then resumed and we expect to hit the next thread2 breakpoint. The problem is that thread2 can't hit the breakpoint until the main thread has proceeded far enough, and if the vm.suspend() that suspended the main thread happens too quickly, it won't have proceeded far enough, so thread2 never hits the breakpoint.

Essentially we need a vm.resume() to allow the main thread to run, but we need to do it in a way that does nullify part of what the test is testing for. So in order to allow the vm.resume() but not subvert the intent of the test, we first do a thread2.suspend() so the vm.resume() won't resume thread2.

Testing in progress: tier1 and tier5 svc testing


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20088/head:pull/20088
$ git checkout pull/20088

Update a local copy of the PR:
$ git checkout pull/20088
$ git pull https://git.openjdk.org/jdk.git pull/20088/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20088

View PR using the GUI difftool:
$ git pr show -t 20088

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20088.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 9, 2024

👋 Welcome back cjplummer! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 9, 2024

@plummercj This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent

Reviewed-by: amenkov, kevinw, sspitsyn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 205 new commits pushed to the master branch:

  • 388fcf0: 8336349: Fix more simple -Wzero-as-null-pointer-constant warnings in C2 code
  • ab27aca: 8336297: C2: Fix -Wzero-as-null-pointer-constant warnings in derived Node ctors
  • 9dfcd75: 8334121: Anonymous class capturing two enclosing instances fails to compile
  • 000de30: 8335269: [Graal] occasional timeout in java/lang/StringBuffer/TestSynchronization.java with loom
  • 4635531: 8335159: Move method reference to lambda desugaring before Lower
  • a253e0f: 8335642: Hide Transform implementation for Class-File API
  • 2b0adfc: 8335817: javac AssertionError addLocalVar checkNull
  • a96de6d: 8336256: memcpy short value to int local is incorrect in VtableStubs::unsafe_hash
  • 3f2636d: 8335820: java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java fails due to IllegalArgumentException: hash must be nonzero
  • a9f5e76: 8335905: CompoundElement API cleanup
  • ... and 195 more: https://git.openjdk.org/jdk/compare/b5d589623c174757e946011495f771718318f1cc...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8072701 8072701: resume001 failed due to ERROR: timeout for waiting for a BreakpintEvent Jul 9, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 9, 2024
@openjdk
Copy link

openjdk bot commented Jul 9, 2024

@plummercj The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Jul 9, 2024
@mlbridge
Copy link

mlbridge bot commented Jul 9, 2024

Webrevs

Copy link
Contributor

@kevinjwalls kevinjwalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through and counted suspend/resumes on my fingers, seems good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 9, 2024
break label1;
}

// We need to resume the main thread because thread2 might be blocked on it,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct to me.
This is the last test scenario - thread2.resume should resumes the thread while vm is suspended.
thread2 should not be blocked on main thread.
Looking at the debuggee I suppose the blocking is possible during logging. I'd suggest to update the debugee and remove any logging between breakpoints 2 and 3

Copy link
Contributor Author

@plummercj plummercj Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct to me.
This is the last test scenario - thread2.resume should resumes the thread while vm is suspended.
thread2 should not be blocked on main thread.
Looking at the debuggee I suppose the blocking is possible during logging. I'd suggest to update the debugee and remove any logging between breakpoints 2 and 3

It looks like the debuggee gets as far as the following:

    public void runt2() {
        log("method 'runt2' enter 1");
        i1++;
        i2--;
        log("method 'run2t' exit");
        return;
    }

It prints the first log and hits a breakpoint setup on the 2nd line. The debugger resumes thread2 after this, but we never see the 2nd log. Whenever we see this failure, the following logs from the mainThread are always delayed (by a lot):

debugee.stderr> **> mainThread: mainThread is out of: synchronized (lockingObject) {
debugee.stderr> **> mainThread: waiting for an instruction from the debugger ...

I think this delay is resulting in the the mainThread being in the middle of doing one of these logs when the vm.suspend() is done. This leaves mainThread suspended while holding a lock needed for doing logging (logging is just a simple System.err.prinln()). I'm trying to prove this by getting a debuggee thread dump when getting the 3rd Breakpoint event times out, but for some reason once I added this code I could no longer reproduce the problem (still trying though).

I don't like the idea of avoiding this issue by getting rid of certain problematic logging. It seems error prone. Someone could add some new logging in the future. I'll see if there is a better solution than the vm.resume(). Perhaps I could just resume mainThread. However, I think with virtual threads I/O can be dependent on other threads like an "unparker" thread.

Another solution might be to have the debugger and debuggee do an additional handshake so we can guarantee that mainThread is done with these two log statements. Currently when we get to the 2nd log statement, that just means the debuggee is waiting for a "quit" command from the debugger. We could at this point have the debuggee send a command to the debugger, and have the debugger wait for this command before it does the vm.suspend().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finally able to reproduce the issue with the stack dumping support in place. mainThread is in the middle of printing the 1st of the two logs mention above. mainThead is suspended and is holding a println related lock, and thread2 is stuck on the 2nd log call in runt2 awaiting the same lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to add synchronization between the debugger and debuggee to fix this issue, but I don't really like the solution. It just adds more complexity and makes the test even harder to follow. What I'd like to do is just put a short sleep in before the vm.suspend(). Let me know if you think this is ok and I'll update the PR with the changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirming the reason of the timeout.

To be more clear about my point:
The test has 3 scenarios (see the test description):
ThreadReference.resume() resumes the thread suspended with:

    • with thread2.suspend()
    • at a breakpoint
    • with VirtualMachine.suspend()

So for 3rd scenario we should not call vm.resume() (as it converts 3rd scenario to 1st scenario).
The test can be fixed by different ways, to me remove logging between breakpoint2 and breakpoint3 is the simplest way.
Note that breakpoint2 is "runt2(), line 2" and breakpoint3 is "runt1(), line 7", there are 2 log statements. We can move breakpoint 3 to "runt2(), line 3" (I don't see much sense to have breakpoint 3 so far from breakpoint2 - we just need to ensure the thread was resumed )

Copy link
Contributor Author

@plummercj plummercj Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logging should be removed? Alex suggest the logging between breakpoints 2 and 3, but even that is not enough. There is logging after breakpoint 3, and that happens before the vm.resume() is done. I'm not saying this can't be done right, but I think pruning out logging like this in order to fix the problem not only removes some valuable logging from the test, but still leaves us open to this type of issue.

I think the safer thing to do is to make sure mainThread is no longer logging (or will attempt to log) when the vmsuspend is done. This could be done by pruning some of its logging, but that has the same negatives as removing some thread2 logging. My sleep suggestion is by far the simplest. The "purist" fix would probably be the checkpoint approach (don't do the vm.suspend until mainThread has reached a stable point). That ended up getting a bit uglier than I had hoped, but I can finish up the work so you can have a look at it if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm unclear on the different threads involved here. IIUC the vm.suspend comes from the debugger, and the mainthread and thread-2 are both threads in the debuggee, being suspended at different times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. thread2 is suspended via breakpoint (multiple times). mainThread is suspended by the one place in the test that does a vm.suspend(), which is near the end of the test. This is the problematic suspend because sometimes it is done while mainThread is in the middle of a println. A bit later thread2 is resumed and ends up blocking on a println due to mainThread holding the needed lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the implementation so now it does a sync point after mainThread is done with printlns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logging should be removed? Alex suggest the logging between breakpoints 2 and 3, but even that is not enough. There is logging after breakpoint 3, and that happens before the vm.resume() is done.

After breakpoint 3 debugger calls thread2.resume() and vm.resume(), so there is no issue there (all threads are resumed).

@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 11, 2024
@alexmenkov
Copy link

I'm ok with the new solution.
Need to update copyright year in resume001a.java and fix whitespace issue in resume001.java

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 11, 2024
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 12, 2024
Comment on lines +381 to +382
vm.resume();
vm.resume(); // for case error when both VirtualMachine and the thread2 were suspended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing but I don't understand the comment. Why would you need 2 vm.resume() here? If thread2 was suspended directly don't you need a thread2.resume()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First just to clarify a general JDI feature about thread suspending and resuming. You can undo a ThreadReference.suspend() or a thread suspended as the result of an event by dong a vm.resume(). This is documented in the JDI API spec, which talks about suspendCounts and how various APIs and event delivery affect them.

I was tempted to clean up these vm.resume() calls but opted not to. The point being made in the comment is that worse case thread2 was suspended by a breakpoint or thread2.suspend() and all threads were suspended by a vm.resume() (meaning thread2 could have a suspendCount of 2). Two vm.resumes() take are done to make sure thread2 gets resumed under this situation. However, one of the vm.resume calls could instead be a thread2.resume(). Doing two vm.resume() calls was probably just laziness by the original authors. It works though.

However, by my accounting at any failure point thread2 never has a suspendCount > 1, so really just one vm.resume() would be enough.

The original code did these two vm.resume() calls unconditionally, but they are not needed if there was no error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code had 2 vm.resume() - one on them to match vm.suspend() and 2nd one to allow debugee to continue on error.
Now we have 3 vm.resume() - one is to match vm.suspend() (line 377) and 2 conditional (on error).
Theoretically we can get an error when both vm and thread2 are suspended, so 2 vm.resume() looks reasonable.
Anyway resume() is a nop if the thread is not suspended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reaching the 2nd breakpoint, which suspends thread2, we do a vm.suspend(), which bumps the thread2 suspendCount to 2. However, we do a eventSet.resume() after this, lowering the suspendCount to 1, and there is no error bailout point while the suspendCount is 2. Thus only 1 vm.resume() is needed in the error handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this discussion about the number of vm.resume() calls that are needed
or not needed and the fact that one of those vm.resume() calls could be replaced
by a thread2.resume() call perfectly illustrates just how complicated this test is.

Thanks for going thru the effort to get rid of the sleep() call. I appreciate it.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think the explicit sync with the mainthread seems reasonable too.

@sspitsyn
Copy link
Contributor

I agree with Chris that this test is over-complicated.
The introduction of the explicit sync makes it easier to understand.
I've got also confused by the two pre-existed sub-sequential vm.resume().

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@dcubed-ojdk
Copy link
Member

Is there any chance that all the Breakpint typos can be fixed?

@plummercj
Copy link
Contributor Author

Is there any chance that all the Breakpint typos can be fixed?

There was only one and I fixed it already.

* Print information about all threads in debuggee VM
*/
protected void printThreadsInfo(VirtualMachine vm) {
public void printThreadsInfo(VirtualMachine vm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void printThreadsInfo(VirtualMachine vm) {
public void printThreadsInfo(VirtualMachine vm) {

…nce it is now general purpose and not just used when killing debuggee.
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2024
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2024
@plummercj
Copy link
Contributor Author

Thanks for the reviews from Kevin, Alex, and Serguei and input from David, Dan, and Andrey.

/integrate

@openjdk
Copy link

openjdk bot commented Jul 15, 2024

Going to push as commit c8a95a7.
Since your change was applied there have been 205 commits pushed to the master branch:

  • 388fcf0: 8336349: Fix more simple -Wzero-as-null-pointer-constant warnings in C2 code
  • ab27aca: 8336297: C2: Fix -Wzero-as-null-pointer-constant warnings in derived Node ctors
  • 9dfcd75: 8334121: Anonymous class capturing two enclosing instances fails to compile
  • 000de30: 8335269: [Graal] occasional timeout in java/lang/StringBuffer/TestSynchronization.java with loom
  • 4635531: 8335159: Move method reference to lambda desugaring before Lower
  • a253e0f: 8335642: Hide Transform implementation for Class-File API
  • 2b0adfc: 8335817: javac AssertionError addLocalVar checkNull
  • a96de6d: 8336256: memcpy short value to int local is incorrect in VtableStubs::unsafe_hash
  • 3f2636d: 8335820: java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java fails due to IllegalArgumentException: hash must be nonzero
  • a9f5e76: 8335905: CompoundElement API cleanup
  • ... and 195 more: https://git.openjdk.org/jdk/compare/b5d589623c174757e946011495f771718318f1cc...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 15, 2024
@openjdk openjdk bot closed this Jul 15, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 15, 2024
@openjdk
Copy link

openjdk bot commented Jul 15, 2024

@plummercj Pushed as commit c8a95a7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

7 participants