Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8319633: runtime/posixSig/TestPosixSig.java intermittent timeouts on UNIX #16797

Closed
wants to merge 5 commits into from

Conversation

JoKern65
Copy link
Contributor

@JoKern65 JoKern65 commented Nov 23, 2023

Every 1-2 weeks we run into timeouts when running jtreg test runtime/posixSig/TestPosixSig.java on UNIX.
The thread stack shows that we are in line 54 of TestPosixSig.java.

The reason is the following: The test registers a new dummy signal handler for SIGILL, without delegating the task to the previous handler in the chain. In case the VM then calls a Java method marked as not-entrant at least on PPC64 a SIGILL is raised. Because this is not handled by the registered handler the SIGILL will happen again and again in an endless recursion.
One solution would be to add a delegation to the hotspot signal handler, which is the previous handler in the chain.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8319633: runtime/posixSig/TestPosixSig.java intermittent timeouts on UNIX (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16797/head:pull/16797
$ git checkout pull/16797

Update a local copy of the PR:
$ git checkout pull/16797
$ git pull https://git.openjdk.org/jdk.git pull/16797/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16797

View PR using the GUI difftool:
$ git pr show -t 16797

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16797.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 23, 2023

👋 Welcome back jkern! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 23, 2023
@openjdk
Copy link

openjdk bot commented Nov 23, 2023

@JoKern65 The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Nov 23, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 23, 2023

Webrevs

@tstuefe
Copy link
Member

tstuefe commented Nov 23, 2023

Good catch. I am surprised that this does not happen more often.

This is not an AIX issue. Please change the JBS ticket (title, os, and description) and PR title to make it a general issue on all *nixes.

The issue is that this test wants to check that the periodic JNI checker catches modified signal handlers (if a native app uses signals, it must use the signal interposition library). This is racy - there is a time window between setting the handler and the VM noticing it; any signal we receive during that time will not be processed by the VM.

This issue highlights an inherent problem with the JNI checker. Maybe we should increase its frequency, but with 10ms its already quite high.

My solution would have been just to use a signal that is monitored by the JNI checker, but not used in operations; that is only expected to be triggered from outside. I would have replaced SIGILL with SIGQUIT, which is used to trigger a thread dump.

@JoKern65 JoKern65 changed the title JDK-8319633: runtime/posixSig/TestPosixSig.java intermittent timeouts on AIX JDK-8319633: runtime/posixSig/TestPosixSig.java intermittent timeouts on UNIX Nov 23, 2023
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the failure only happen when running with Xcomp? If so we could just require it to not run then. Otherwise the chaining seem reasonable, but needs a change I think.

Thanks

act.sa_flags = 0;
int retval = sigaction(val, &act, 0);
act.sa_flags = SA_SIGINFO;
int retval = sigaction(val, &act, &oact);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still leave a window where the new handler is installed but not yet chained. Better to use one sigaction to read the old handler, then install the new one.

@tstuefe
Copy link
Member

tstuefe commented Nov 24, 2023

Does the failure only happen when running with Xcomp? If so we could just require it to not run then. Otherwise the chaining seem reasonable, but needs a change I think.

Thanks

Wouldn't just using a signal which we know is not generated by the hotspot itself but still monitored by the JNI checker - eg SIGQUIT - not be simpler?

p.s I'm not insisting on that; if @JoKern65 wants to do the chaining, that is fine.

@JoKern65
Copy link
Contributor Author

Does the failure only happen when running with Xcomp? If so we could just require it to not run then. Otherwise the chaining seem reasonable, but needs a change I think.

Thanks

The failure does not happen (as far as I could see) when adding -Xint

@dholmes-ora
Copy link
Member

Okay so three options here:

  1. Use -Xint and avoid the JIT generated SIGILL that causes the problem
  2. Switch to a signal other than SIGILL that won't be generated during execution
  3. Do the chaining as proposed, but safely.

I think 1 or 2 is simpler.

@tstuefe
Copy link
Member

tstuefe commented Nov 25, 2023

Okay so three options here:

1. Use -Xint and avoid the JIT generated SIGILL that causes the problem

2. Switch to a signal other than SIGILL that won't be generated during execution

3. Do the chaining as proposed, but safely.

I think 1 or 2 is simpler.

Of the three, I opt for 1, with a clear comment.

@JoKern65
Copy link
Contributor Author

I switched to the -Xint solution as proposed. Thank you for your help.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@openjdk
Copy link

openjdk bot commented Nov 27, 2023

@JoKern65 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8319633: runtime/posixSig/TestPosixSig.java intermittent timeouts on UNIX

Reviewed-by: dholmes, stuefe, mdoerr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 37 new commits pushed to the master branch:

  • efc3922: 8319048: Monitor deflation unlink phase prolongs time to safepoint
  • debf0ec: 8313355: javax/management/remote/mandatory/notif/ListenerScaleTest.java failed with "Exception: Failed: ratio=792.2791601423487"
  • 20aae3c: 8320533: Adjust capstone integration for v6 changes
  • 0678253: 8320602: Lock contention in SchemaDVFactory.getInstance()
  • f1a24f6: 8318599: HttpURLConnection cache issues leading to crashes in JGSS w/ native GSS introduced by 8303809
  • 7848ed7: 8301856: Generated .spec file for RPM installers uninstalls desktop launcher on update
  • 726f854: 8320706: RuntimePackageTest.testUsrInstallDir test fails on Linux
  • 1bb250c: 8261837: SIGSEGV in ciVirtualCallTypeData::translate_from
  • 5f7f2c4: 8320249: tools/jpackage/share/AddLauncherTest.java#id1 fails intermittently on Windows in verifyDescription
  • 6871a2f: 8320803: Update SourceVersion.RELEASE_22 description for language changes
  • ... and 27 more: https://git.openjdk.org/jdk/compare/3787ff8d1d8dbcaaebb9616c5bc543e2fe21a90c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @tstuefe, @TheRealMDoerr) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 27, 2023
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. Thanks

@@ -44,8 +44,16 @@ public static void main(String[] args) throws Throwable {
if (args.length == 0) {

// Create a new java process for the TestPsig Java/JNI test.
// We run the VM in interpreted mode, because the JIT might mark
// a Java method as not-entrant, which means turning the first opcode
// of the compiled method to NULL. Calling such a method after establishing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is PPC64 specific. Please make it more general. That the instruction "0" generates SIGILL is specified by the PPC64 ISA and may be wrong for other platforms. Better would be "which means turning the first instruction into an illegal one".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would be good to mention that the problem is one of time: "If a SIGILL arrives after we redirected the signal handler but before the JNI checker noted the signal handler modification, the JVM may crash or hang; since SIGILLs may be generated by compiled code, we run interpreted".

// below, but before the JNI checker noted and reacted on this signal handler
// modification, the JVM may crash or hang in an endless loop, where the
// illegal instruction will be continously executed, raising SIGILL, and
// the signal handler will return to the illegal insruction again...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: insruction
Otherwise, LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo fixed

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

@JoKern65
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Nov 28, 2023
@openjdk
Copy link

openjdk bot commented Nov 28, 2023

@JoKern65
Your change (at version aa01f8c) is now ready to be sponsored by a Committer.

@TheRealMDoerr
Copy link
Contributor

/sponsor

@openjdk
Copy link

openjdk bot commented Nov 28, 2023

Going to push as commit 464dc3d.
Since your change was applied there have been 37 commits pushed to the master branch:

  • efc3922: 8319048: Monitor deflation unlink phase prolongs time to safepoint
  • debf0ec: 8313355: javax/management/remote/mandatory/notif/ListenerScaleTest.java failed with "Exception: Failed: ratio=792.2791601423487"
  • 20aae3c: 8320533: Adjust capstone integration for v6 changes
  • 0678253: 8320602: Lock contention in SchemaDVFactory.getInstance()
  • f1a24f6: 8318599: HttpURLConnection cache issues leading to crashes in JGSS w/ native GSS introduced by 8303809
  • 7848ed7: 8301856: Generated .spec file for RPM installers uninstalls desktop launcher on update
  • 726f854: 8320706: RuntimePackageTest.testUsrInstallDir test fails on Linux
  • 1bb250c: 8261837: SIGSEGV in ciVirtualCallTypeData::translate_from
  • 5f7f2c4: 8320249: tools/jpackage/share/AddLauncherTest.java#id1 fails intermittently on Windows in verifyDescription
  • 6871a2f: 8320803: Update SourceVersion.RELEASE_22 description for language changes
  • ... and 27 more: https://git.openjdk.org/jdk/compare/3787ff8d1d8dbcaaebb9616c5bc543e2fe21a90c...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 28, 2023
@openjdk openjdk bot closed this Nov 28, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Nov 28, 2023
@openjdk
Copy link

openjdk bot commented Nov 28, 2023

@TheRealMDoerr @JoKern65 Pushed as commit 464dc3d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
4 participants