Skip to content

8293166: jdk/jfr/jvm/TestDumpOnCrash.java fails on Linux ppc64le and Linux aarch64 #10943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

schmelter-sap
Copy link
Contributor

@schmelter-sap schmelter-sap commented Nov 2, 2022

Disabling tiered compilation avoids the sporadic failures of the test.

On ppc64 and aarch64 a trap-based mechanism is used to switch from tier 1 to higher tiers. In the test a crash is provoked and a secondary error handler is installed at the start of error reporting, which doesn't handle these traps anymore and just stops the thread. But since the thread state is 'in Java', this prevents any safepoint to be executed. And this causes the JFR emergency dump to hang in a native to VM transition, so the dump is not written and the test fails (see the bug report for more details).


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8293166: jdk/jfr/jvm/TestDumpOnCrash.java fails on Linux ppc64le and Linux aarch64

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10943/head:pull/10943
$ git checkout pull/10943

Update a local copy of the PR:
$ git checkout pull/10943
$ git pull https://git.openjdk.org/jdk pull/10943/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10943

View PR using the GUI difftool:
$ git pr show -t 10943

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10943.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 2, 2022

👋 Welcome back rschmelter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 2, 2022
@openjdk
Copy link

openjdk bot commented Nov 2, 2022

@schmelter-sap The following label will be automatically applied to this pull request:

  • hotspot-jfr

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-jfr hotspot-jfr-dev@openjdk.org label Nov 2, 2022
@mlbridge
Copy link

mlbridge bot commented Nov 2, 2022

Webrevs

Copy link
Member

@MBaesken MBaesken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, please adjust as well the Copyright header of the changed file.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a fix? Surely this test reveals a real bug.

@openjdk
Copy link

openjdk bot commented Nov 2, 2022

@schmelter-sap This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8293166: jdk/jfr/jvm/TestDumpOnCrash.java fails on Linux ppc64le and Linux aarch64

Reviewed-by: mbaesken, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 92 new commits pushed to the master branch:

  • d4376f8: 8296406: ChainingConstructors jtreg test reduce code duplication
  • 4a0093c: 8294362: UL: Replace the internal usage of manual buffers with stringStream in LogSelection
  • fef68bb: 8296515: RISC-V: Small refactoring for MaxReductionV/MinReductionV/AddReductionV node implementation
  • 82cbfb5: 8296140: Drop unused field java.util.Calendar.DATE_MASK
  • fd83764: 8296239: ISO 4217 Amendment 174 Update
  • d9b25e8: 8296426: x86: Narrow UseAVX and UseSSE flags
  • 8146e1a: 8296347: Memory leak from ClassPathDirEntry::_dir
  • 671f84b: 8296143: CertAttrSet's set/get mechanism is not type-safe
  • d04d656: 8296433: Encountered null CLD while loading shared lambda proxy class
  • 74f2b16: 8295303: cleanup debug agent's confusing use of EI_GC_FINISH
  • ... and 82 more: https://git.openjdk.org/jdk/compare/50d91a31d495adf8e189d0188918f4ff22f93876...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 2, 2022
@MBaesken
Copy link
Member

MBaesken commented Nov 2, 2022

How is this a fix? Surely this test reveals a real bug.

I think it was seen as a pragmatic test adjustment after the slowdowns caused by 8242181. See https://bugs.openjdk.org/browse/JDK-8293166 for details
"Disabling tiered compilation for the test avoids the traps in the code and lets the test work (at least on ppc, where I tested it). While certainly not bulletproof it seems to be good enough from a practical standpoint, given that the test has run reliably before."

@schmelter-sap
Copy link
Contributor Author

How is this a fix? Surely this test reveals a real bug.

The feature to dump the JFR file during a crash is a best effort feature. It cannot be guaranteed to work ever time. This is why the test case already used 3 attempts. But the timing of the test has been changed , because we parse the dwarf information now. And this almost guarantees that we run into a case where the dump will not work on these platforms, rendering this test almost useless.

@tstuefe
Copy link
Member

tstuefe commented Nov 9, 2022

Just to make sure I understand.

Java Thread A runs into a fatal error. Secondary signal handler is installed. Error reporting starts, hs-err file is written. Then, JFR emergency dump is attempted. JFR emergency dump tries to enter native VM state. That causes SIGTRAP to fire in a different thread B? But we don't handle SIGTRAP anymore, so now we get "Thread B also had an error"? And the safepoint times out?

@schmelter-sap
Copy link
Contributor Author

Just to make sure I understand.

JFR emergency dump tries to enter native VM state. That causes SIGTRAP to fire in a different thread B?

No, there is a JFR Java thread which periodically runs and this one traps because the code switches from tier 1 to a higher tier. Since this thread is "In Java" it prevents a guaranteed safepoint to finish. And the JFR code then waits forever at the native->in VM transition.

@tstuefe
Copy link
Member

tstuefe commented Nov 9, 2022

After discussing this offline with Ralf, I understand the problem better, and I think the workaround makes sense.

IIUC the problem is that JFR dumper attempts a SafePoint while in fatal error reporting. Another thread (the JFR sampler thread) happens to be in Java, attempts to enter the safepoint, but gets stuck because we switched the signal handler and nobody is there to handle SIGTRAP. It gets stalled into an endless loop in VMError::report_and_die ("Thread ABC also had an error"). Reporting thread then times out, no JFR dump.

The underlying problem is that we don't handle SafePoint faults in the crash handler. We could add SafePoint handling to the secondary crash handler - we already handle SafeFetch faults the same way. But do we want to do this? We are in a fatal error situation. The fact that all threads stop cold once they enter SafePoints could even be seen as a feature: if the VM is in fatal error mode and does error reporting, we want as little code running concurrently as possible. To not interfere with error reporting and to get unspoiled cores.


Some details I don't understand yet. Why does it only happen on aarch64 and PPC? Should this not happen on all platforms? The SIGSEGV mechanism should not work there either.

And this is more of a @TheRealMDoerr question: I'm confused about how ppc implements SafePoints. Why do we use SIGTRAP? Its handling seems so complex. On all other platforms, we just "SIGSEGV + crash address in polling page -> goto safepoint stub". On PPC, if SIGTRAP, we need to check instruction (TD), and then we search the code blob. We do three linear searches, twice for the code heap and once for the blob in the code heap (CodeCache::contains and CodeCache::find_blob). So much more work than what the other platforms do. I must be missing something obvious. Why is SIGTRAP better?

Cheers, Thomas

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@schmelter-sap
Copy link
Contributor Author

IIUC the problem is that JFR dumper attempts a SafePoint while in fatal error reporting.

It doesn't attempt a safepoint, it is stopped at the native to "in VM" transition because a safepoint is pending. And the safepoint will never begin, since the sampler thread has crashed, leaving the thread in the "in VM" state, which is not safepoint safe.

Another thread (the JFR sampler thread) happens to be in Java, attempts to enter the safepoint, but gets stuck because we switched the signal handler and nobody is there to handle SIGTRAP.

The sampler thread crashes because it calls a tier 1 compiled method, for which we already have created a higher tier one. And to make the switch to the higher tier method, a trap instruction is written to the verified entry point of the tier 1 method on aarch64 and ppc64 (technically on ppc64 a trap is used in debug VMs and when the jump offset is too large and on aarch64 only if the jump offset is too large [128 M in a product build and only 2 M in a debug build]).

The underlying problem is that we don't handle SafePoint faults in the crash handler.

And we don't handle implicit null checks and similar things where we use the signal handler.

Cheers,
Ralf

@schmelter-sap
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 15, 2022

Going to push as commit 5551cb6.
Since your change was applied there have been 155 commits pushed to the master branch:

  • 8a9eabb: 8296786: Limit VM modes for com/sun/jdi/JdbLastErrorTest.java
  • 873eccd: 8296923: JFR: jfr --version should return System.getProperty("java version")
  • 93d6b1f: 8295711: Rename ZBarrierSetAssembler::load_at parameter name from "tmp_thread" to "tmp2"
  • 2f7dc5c: 8296089: Remove debug agent code for special handling of Thread.resume()
  • c71d87e: 8286624: Regression Test CoordinateTruncationBug.java fails on OL8.3
  • a7c2338: 8296900: CertificateValidity fields are not optional
  • 3eb789a: 8296171: Compiler incorrectly rejects code with variadic method references
  • 749335d: 8291911: java/io/File/GetXSpace.java fails with "53687091200 != 161051996160"
  • 95b8405: 8296431: PushbackInputStream should override transferTo
  • e269dc0: 8293681: ResponseAPDU getData() method javadoc
  • ... and 145 more: https://git.openjdk.org/jdk/compare/50d91a31d495adf8e189d0188918f4ff22f93876...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 15, 2022
@openjdk openjdk bot closed this Nov 15, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 15, 2022
@openjdk
Copy link

openjdk bot commented Nov 15, 2022

@schmelter-sap Pushed as commit 5551cb6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-jfr hotspot-jfr-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants