Skip to content

8305209: JDWP exit error AGENT_ERROR_INVALID_THREAD(203): missing entry in running thread table#13246

Closed
plummercj wants to merge 4 commits intoopenjdk:masterfrom
plummercj:8305209_missing_thread
Closed

8305209: JDWP exit error AGENT_ERROR_INVALID_THREAD(203): missing entry in running thread table#13246
plummercj wants to merge 4 commits intoopenjdk:masterfrom
plummercj:8305209_missing_thread

Conversation

@plummercj
Copy link
Contributor

@plummercj plummercj commented Mar 30, 2023

The real purpose of this PR is to add virtual thread support to ThreadMemoryLeakTest.java, but this exposed bugs in both the debug agent and in TestScaffold, so those are being fixed also (and the debug agent bug is the CR being used).

The debug agent bug is due to a race condition during VM exit. The VM is in the process of shutting down. The debug agent has already disabled JVMTI callbacks and has sent the VMDeathEvent. At this point in time there are also threads exiting that the debug agent knows about, but it will not get a ThreadEndEvent for because of the callbacks being disabled. Thus these threads remain in the debug agent's list of known threads, even though they have exited. The debuggee receives the VMDeathEvent and does a VM.resume(). During the debug agent's handing of the VM.Resume command, it iterates over all known threads and needs to map each to its ThreadNode so it can be resumed, and this mapping requires accessing the JVMTI TLS for the thread. The problem is some of the threads may have exited already, and therefore no longer have TLS. This results in the assert in the debug agent. This debug agent issue was already addressed for platform threads, but not for virtual threads, which is why we started seeing this issue when this test was modified. The fix is to just replicate what is done for platform threads for virtual threads also.

The TestScaffold bug is that if the debuggee crashes/asserts, this is likely to go unnoticed, especially if it happens during VM exit (and the test essentially has already completed). Because of this TestScaffold bug, the debug agent bug above did not result in a test failure. After fixing TestScaffold to check the exitCode of the debuggee process, the test started to appropriately fail until the debug agent was fixed.

One other thing to point out is the OOME issue I started getting frequently when testing with virtual threads. Since virtual threads are created at a much higher rate than platform threads, their creation started to overwhelm the debugger (actually the JDI implementation). There is already a mechanism in place to do a VM.HoldEvents if JDI has queue up 10,000 events. The problem is that events are coming in so fast that even after doing the VM.HoldEvents, the number of queued events continues to go up for a while, and sometimes reaches 30,000 or more. This raises the peak memory usage of the test quite a bit. Since the test purposely uses a small heap so a memory leak is quickly and reliably detected, the large queue often results in an OOME. Because of this I make virtual threads sleep for 100ms instead of 50ms to slow down their creation, and this resolved the issue.

I tested by running all of test/jdk/com/sun/jdi 25 times on each platform with and without virtual thread testing enabled.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8305209: JDWP exit error AGENT_ERROR_INVALID_THREAD(203): missing entry in running thread table

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13246/head:pull/13246
$ git checkout pull/13246

Update a local copy of the PR:
$ git checkout pull/13246
$ git pull https://git.openjdk.org/jdk.git pull/13246/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13246

View PR using the GUI difftool:
$ git pr show -t 13246

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13246.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 30, 2023

👋 Welcome back cjplummer! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8305209 8305209: JDWP exit error AGENT_ERROR_INVALID_THREAD(203): missing entry in running thread table Mar 30, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 30, 2023
@openjdk
Copy link

openjdk bot commented Mar 30, 2023

@plummercj The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Mar 30, 2023
@mlbridge
Copy link

mlbridge bot commented Mar 30, 2023

Webrevs

@plummercj
Copy link
Contributor Author

Ping!

throw new RuntimeException("Non-zero debuggee exitValue: " + res);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. It looks like this code is important to add in general.

// leading to high memory use for all the unprocessed events
// that get queued up, so we need to slow it down a bit more
// than we do for platform threads to avoid getting OOME.
Thread.sleep(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this time for sleep can still be not enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a lot of testing on all platforms, including with product builds, but yes, it is possible that on some some platforms with some flags it might not be enough. I guess more testing will tell. Adjustments might be necessary. It is important not too slow things down too much, or it's possible that if there is a memory leak, the test won't catch it because the leak is not fast enough. With the current sleep values, throughput for virtual threads is still about 2x what it is for platform threads, so right now I'm not worried about it having been slowed down too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit. shorter to use:
long timeToSleep = "Virtual".equals(mainWrapper) ? 100 : 50;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

node = nonTlsSearch(getEnv(), &runningVThreads, thread);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix here is reasonable.

* thread has terminated, but we never got the THREAD_END event.
* Search the runningThreads and runningVThreads lists. The TLS lookup may have
* failed because the thread has terminated, but we never got the THREAD_END event.
* The big comment immediately above explains why this can happen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd suggest to get rid of word "immediately" here. :)

@openjdk
Copy link

openjdk bot commented Apr 4, 2023

@plummercj This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8305209: JDWP exit error AGENT_ERROR_INVALID_THREAD(203): missing entry in running thread table

Reviewed-by: sspitsyn, lmesnik

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 102 new commits pushed to the master branch:

  • a6a3cf4: 8305480: test/hotspot/jtreg/runtime/NMT/VirtualAllocCommitMerge.java failing on 32 bit arm
  • 35d2293: 8305607: Remove some unused test parameters in com/sun/jdi tests
  • 5764119: 8303563: GetCurrentThreadCpuTime and GetThreadCpuTime need further clarification for virtual threads
  • 3127025: 8305600: java/lang/invoke/lambda/LogGeneratedClassesTest.java fails after JDK-8304846 and JDK-8202110
  • 35cb303: 8305425: Thread.isAlive0 doesn't need to call into the VM
  • b5d204c: 8305678: ProblemList serviceability/sa/ClhsdbInspect.java on windows-x64 in Xcomp
  • 507c49a: 8305664: [BACKOUT] (fs) Remove FileSystem support for resolving against a default directory (chdir configuration)
  • 39f12a8: 8305596: (fc) Two java/nio/channels tests fail after JDK-8303260
  • 44f33ad: 8304982: Emit warning for removal of COMPAT provider
  • ee30233: 8305107: Emoji related binary properties in RegEx
  • ... and 92 more: https://git.openjdk.org/jdk/compare/b3ff8d1c89b0f968b7b5ec2105502778524e4e4a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 4, 2023
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
if (p.isAlive()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think it is better to just use watiFor(). If debugee hangs it would be better to times out and give timeout handler a chance to dump all stack traces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I hadn't thought about the timeout handler.

// leading to high memory use for all the unprocessed events
// that get queued up, so we need to slow it down a bit more
// than we do for platform threads to avoid getting OOME.
Thread.sleep(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit. shorter to use:
long timeToSleep = "Virtual".equals(mainWrapper) ? 100 : 50;

@plummercj
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Apr 6, 2023

Going to push as commit 1d517af.
Since your change was applied there have been 109 commits pushed to the master branch:

  • 08fbb7b: 8272119: Typo in JDK documentation (a -> an)
  • 536ad9d: 8305461: [vectorapi] Add VectorMask::xor
  • ddd50d0: 8305608: Change VMConnection to use "test.class.path"instead of "test.classes"
  • ce10460: 8274166: Some CDS tests ignore -Dtest.cds.runtime.options
  • e52a2ae: 8304745: Lazily initialize byte[] in java.io.BufferedInputStream
  • 6580c4e: 8267140: Support closing the HttpClient by making it auto-closable
  • b5ea140: 8269843: typo in LinkedHashMap::removeEldestEntry spec
  • a6a3cf4: 8305480: test/hotspot/jtreg/runtime/NMT/VirtualAllocCommitMerge.java failing on 32 bit arm
  • 35d2293: 8305607: Remove some unused test parameters in com/sun/jdi tests
  • 5764119: 8303563: GetCurrentThreadCpuTime and GetThreadCpuTime need further clarification for virtual threads
  • ... and 99 more: https://git.openjdk.org/jdk/compare/b3ff8d1c89b0f968b7b5ec2105502778524e4e4a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 6, 2023
@openjdk openjdk bot closed this Apr 6, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 6, 2023
@openjdk
Copy link

openjdk bot commented Apr 6, 2023

@plummercj Pushed as commit 1d517af.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

3 participants