Skip to content

JDK-8305416: runtime/Thread/TestAlwaysPreTouchStacks.java failed with "Did not find expected NMT output" #13295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Apr 3, 2023

Fix test. Several issues caused the test to fail intermittently (only on Windows for some reason).

The test starts a a child process with -XX:+AlwaysPreTouchStacks. In this child process, it creates n threads with given stack sizes, then expects them to come up successfully, then checks the final NMT output for the expected stack committed size (which should ideally be very close to the stack reserved size, in contrast to a VM started without -AlwaysPreTouchStacks).

The test had several errors:

  • the test threads exited, and they were non-daemon threads - therefore, the final NMT report printed during JVM shutdown did not account for the test thread stacks.
  • The test threads create log output (via Xlog). If that output gets mixed into the final NMT report, the report parser gets confused. This is what happened on Windows, and this made the error intermittent.

The new, fixed test:

  • creates the test threads as daemon threads and makes them loop endlessly, so they stick around till VM death and show up in the final NMT report. We also create more test threads with noticeably larger stacks for them to have a visible impact on NMT-reported stack sizes.
  • removed the Xlog output
  • The test main thread waits until all test threads have successfully started.
  • I removed the testing of different stack sizes since it was not useful
  • I made the test stricter:
    • The total committed thread stack size should be more than 3/4ths of the total committed reserved stack size for the test to succeed. This, in combination with large test thread stacks, is enough to prove that PreTouch is working
    • I also added a santity test for the total reserved stack size to be larger - but not insanely so - than the combined test thread stack sizes to prove that what we see is caused by our test threads.

Tests: manually ran tests on Windows x64 and Linux x64.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8305416: runtime/Thread/TestAlwaysPreTouchStacks.java failed with "Did not find expected NMT output"

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13295/head:pull/13295
$ git checkout pull/13295

Update a local copy of the PR:
$ git checkout pull/13295
$ git pull https://git.openjdk.org/jdk.git pull/13295/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13295

View PR using the GUI difftool:
$ git pr show -t 13295

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13295.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 3, 2023

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 3, 2023

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Apr 3, 2023
@tstuefe tstuefe force-pushed the JDK-8305416-runtime-Thread-TestAlwaysPreTouchStacks-fails branch 3 times, most recently from 892a127 to ce2a7f2 Compare April 3, 2023 13:14
@tstuefe tstuefe force-pushed the JDK-8305416-runtime-Thread-TestAlwaysPreTouchStacks-fails branch from ce2a7f2 to ae23ee2 Compare April 3, 2023 13:26
@tstuefe tstuefe marked this pull request as ready for review April 3, 2023 13:27
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 3, 2023
@mlbridge
Copy link

mlbridge bot commented Apr 3, 2023

Webrevs

@dcubed-ojdk
Copy link
Member

@tstuefe - At this point, the failure has also been seen on linux-aarch64
and linux-x64. Hopefully that meshes with what you think is wrong with
this test...

@tstuefe
Copy link
Member Author

tstuefe commented Apr 3, 2023

@tstuefe - At this point, the failure has also been seen on linux-aarch64 and linux-x64. Hopefully that meshes with what you think is wrong with this test...

@dcubed-ojdk Okay, I problemlisted the test to give us some time. My patch should work, but no need to put pressure on the review. Could you give me a quick thumbs up please for #13301 ?

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the issue of daemon versus non-daemon as that only affects when the VM can exit, not how long individual threads "stick around" ???

@tstuefe
Copy link
Member Author

tstuefe commented Apr 4, 2023

I don't understand the issue of daemon versus non-daemon as that only affects when the VM can exit, not how long individual threads "stick around" ???

You are right, that's badly phrased. The old version of this test fired up threads that printed something, then finished. The JVM waited for them to exit, then exited itself, printing the NMT report in the process. That report did not contain the thread stacks for the already exited test threads.

The new test makes sure the threads never exit. They have to be daemonized in order to not hold up the JVM exit. Now, the NMT reported stack sized contain the test thread stack sizes.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments and nits. I'm a little concerned about reliability aspects of this test across a range of platforms. I will run some tests here.

You need to update the ProblemList too.

Thanks.

try {
// report aliveness, then sleep until VM death
gate.await();
for(;;) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space after for

for (int s: stackSizes) {
threads.add(createTestThread(s));
// Add a bunch of large-stacked threads to make a significant imprint on combined thread stack size
for (int i = 0; i < numThreads; i ++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no space before ++

@@ -81,12 +101,10 @@ public static void main(String[] args) throws InterruptedException, IOException

output.shouldHaveExitValue(0);

for (int s: stackSizes) {
output.shouldContain("Alive: " + Integer.toString(s));
for (int i = 0; i < numThreads; i ++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no space before ++

Comment on lines +131 to +132
// of their reserved size. Requiring them to be committed for over 3/4th shows that pretouch is
// really working.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many platforms have you tested this on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it on all our platforms and it passed okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested on Windows x64, Linux x64, x86, aarch64 and arm. Arm with the patch based on an older version because HEAD is broken on Arm. For the rest I relied on GHAs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it on all our platforms and it passed okay.

Thank you for that!

Comment on lines +138 to +139
long max_reserved = memoryCeilingMB * 3 * MB;
long min_reserved = memoryCeilingMB * MB;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these boundaries determined? What variance do you see in the reported value when the test runs?

Copy link
Member Author

@tstuefe tstuefe Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these boundaries determined?

There is a vagueness about reserved total stack size of course - it contains other thread's stack sizes too, and any overhead that the VM may add atop of user-specified thread stack sizes, mainly guard pages.

Min is the logical lower end and the actual number should be greater.
Max is chosen very generously to be much larger. Maybe too large, but I did not want to play whack-the-mole for false positives on platforms I have not tested manually (e.g. larger paged ones), while still excluding obviously insane cases.

On the platforms I tested on variance was not large at all; my 128MB reserved stack size for the test threads gives me:

linux x86: 133MB
linux aarch64 (4k pages): 154MB
linux x64: 144MB

(The difference between aarch64 and x64 is interesting but I did not dive deeper into that; I assume number of hotspot threads may differ for threads started based on number of cores, but my x64 machine is way beefier than my aarch64-box so the ratio should be reversed)..

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my concerns have been addressed. Thanks.

@openjdk
Copy link

openjdk bot commented Apr 5, 2023

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8305416: runtime/Thread/TestAlwaysPreTouchStacks.java failed with "Did not find expected NMT output"

Reviewed-by: dholmes, gziemski

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 5, 2023
@tstuefe
Copy link
Member Author

tstuefe commented Apr 10, 2023

Friendly ping

1 similar comment
@tstuefe
Copy link
Member Author

tstuefe commented Apr 19, 2023

Friendly ping

// report, so their stacks should dominate the NMT-reported total stack size.
long max_reserved = memoryCeilingMB * 3 * MB;
long min_reserved = memoryCeilingMB * MB;
if (reserved >= max_reserved || reserved < min_reserved) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is reserved >= max_reserved a concern? Would we really want/need to look into this if it did happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

If we start x Threads, with stack size y, I expect to see not much more than x*y+(very generous fudge factor owed to unknown threads running). If I see more than that, NMT may be showing the wrong number for thread stacks, or scale the number wrong, see #13665.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to rewrite this as a gtest and allocate using VM os::alloc() once then check that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean? This test tests that if I start a VM with -XX:+AlwaysPreTouchThreadStacks the VM will touch all - or most - of the thread stacks upon thread start. It exploits that NMT actually probes the thread stacks for committed pages, and checks what NMT says about thread stacks.

I would not know how to do this in gtest, espectially since forking is involved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean? This test tests that if I start a VM with -XX:+AlwaysPreTouchThreadStacks the VM will touch all - or most - of the thread stacks upon thread start. It exploits that NMT actually probes the thread stacks for committed pages, and checks what NMT says about thread stacks.

I would not know how to do this in gtest, espectially since forking is involved.

Sorry, I saw AlwaysPreTouchThreadStacks, but read AlwaysPreTouch.

Looks good to me.

@openjdk
Copy link

openjdk bot commented Apr 25, 2023

@tstuefe this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8305416-runtime-Thread-TestAlwaysPreTouchStacks-fails
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed ready Pull request is ready to be integrated labels Apr 25, 2023
@openjdk openjdk bot added ready Pull request is ready to be integrated and removed merge-conflict Pull request has merge conflict with target branch labels Apr 27, 2023
@tstuefe
Copy link
Member Author

tstuefe commented May 11, 2023

/integrate

@openjdk
Copy link

openjdk bot commented May 11, 2023

Going to push as commit 8ac7186.
Since your change was applied there have been 144 commits pushed to the master branch:

  • 984fbbb: 8307810: Consistently use LockingMode instead of UseHeavyMonitors
  • 39f4e4d: 8307758: RISC-V: Improve bit test code introduced by JDK-8291555
  • 3cb606e: 8306965: osThread allocation failures should not abort the VM
  • 4795c39: 8281103: Give example for Locale that is English and follows the ISO standards
  • 958d4a7: 8307563: make most fields final in JavacTrees
  • f250ab2: 8307377: Remove use of tagName from TagletWriterImpl.linkSeeReferenceOutput
  • cc9f7ad: 8307765: DynamicArchiveHeader contents are missing in CDS mapfile
  • edc4adb: 8307860: [BACKOUT] JDK-8307194 Add make target for optionally building a complete set of all JDK and hotspot libjvm static libraries
  • df90df2: 8307857: validate-source fails after JDK-8306758
  • 2688364: 8306758: com/sun/jdi/ConnectedVMs.java fails with "Non-zero debuggee exitValue: 143"
  • ... and 134 more: https://git.openjdk.org/jdk/compare/ea9201f42085028f6f16738214f058efb680ac09...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 11, 2023
@openjdk openjdk bot closed this May 11, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 11, 2023
@openjdk
Copy link

openjdk bot commented May 11, 2023

@tstuefe Pushed as commit 8ac7186.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants