-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8293166: jdk/jfr/jvm/TestDumpOnCrash.java fails on Linux ppc64le and Linux aarch64 #10943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back rschmelter! A progress list of the required criteria for merging this PR into |
@schmelter-sap The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, please adjust as well the Copyright header of the changed file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this a fix? Surely this test reveals a real bug.
@schmelter-sap This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 92 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
I think it was seen as a pragmatic test adjustment after the slowdowns caused by 8242181. See https://bugs.openjdk.org/browse/JDK-8293166 for details |
The feature to dump the JFR file during a crash is a best effort feature. It cannot be guaranteed to work ever time. This is why the test case already used 3 attempts. But the timing of the test has been changed , because we parse the dwarf information now. And this almost guarantees that we run into a case where the dump will not work on these platforms, rendering this test almost useless. |
Just to make sure I understand. Java Thread A runs into a fatal error. Secondary signal handler is installed. Error reporting starts, hs-err file is written. Then, JFR emergency dump is attempted. JFR emergency dump tries to enter native VM state. That causes SIGTRAP to fire in a different thread B? But we don't handle SIGTRAP anymore, so now we get "Thread B also had an error"? And the safepoint times out? |
No, there is a JFR Java thread which periodically runs and this one traps because the code switches from tier 1 to a higher tier. Since this thread is "In Java" it prevents a guaranteed safepoint to finish. And the JFR code then waits forever at the native->in VM transition. |
After discussing this offline with Ralf, I understand the problem better, and I think the workaround makes sense. IIUC the problem is that JFR dumper attempts a SafePoint while in fatal error reporting. Another thread (the JFR sampler thread) happens to be in Java, attempts to enter the safepoint, but gets stuck because we switched the signal handler and nobody is there to handle SIGTRAP. It gets stalled into an endless loop in VMError::report_and_die ("Thread ABC also had an error"). Reporting thread then times out, no JFR dump. The underlying problem is that we don't handle SafePoint faults in the crash handler. We could add SafePoint handling to the secondary crash handler - we already handle SafeFetch faults the same way. But do we want to do this? We are in a fatal error situation. The fact that all threads stop cold once they enter SafePoints could even be seen as a feature: if the VM is in fatal error mode and does error reporting, we want as little code running concurrently as possible. To not interfere with error reporting and to get unspoiled cores. Some details I don't understand yet. Why does it only happen on aarch64 and PPC? Should this not happen on all platforms? The SIGSEGV mechanism should not work there either. And this is more of a @TheRealMDoerr question: I'm confused about how ppc implements SafePoints. Why do we use SIGTRAP? Its handling seems so complex. On all other platforms, we just "SIGSEGV + crash address in polling page -> goto safepoint stub". On PPC, if SIGTRAP, we need to check instruction (TD), and then we search the code blob. We do three linear searches, twice for the code heap and once for the blob in the code heap (CodeCache::contains and CodeCache::find_blob). So much more work than what the other platforms do. I must be missing something obvious. Why is SIGTRAP better? Cheers, Thomas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
It doesn't attempt a safepoint, it is stopped at the native to "in VM" transition because a safepoint is pending. And the safepoint will never begin, since the sampler thread has crashed, leaving the thread in the "in VM" state, which is not safepoint safe.
The sampler thread crashes because it calls a tier 1 compiled method, for which we already have created a higher tier one. And to make the switch to the higher tier method, a trap instruction is written to the verified entry point of the tier 1 method on aarch64 and ppc64 (technically on ppc64 a trap is used in debug VMs and when the jump offset is too large and on aarch64 only if the jump offset is too large [128 M in a product build and only 2 M in a debug build]).
And we don't handle implicit null checks and similar things where we use the signal handler. Cheers, |
/integrate |
Going to push as commit 5551cb6.
Your commit was automatically rebased without conflicts. |
@schmelter-sap Pushed as commit 5551cb6. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Disabling tiered compilation avoids the sporadic failures of the test.
On ppc64 and aarch64 a trap-based mechanism is used to switch from tier 1 to higher tiers. In the test a crash is provoked and a secondary error handler is installed at the start of error reporting, which doesn't handle these traps anymore and just stops the thread. But since the thread state is 'in Java', this prevents any safepoint to be executed. And this causes the JFR emergency dump to hang in a native to VM transition, so the dump is not written and the test fails (see the bug report for more details).
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10943/head:pull/10943
$ git checkout pull/10943
Update a local copy of the PR:
$ git checkout pull/10943
$ git pull https://git.openjdk.org/jdk pull/10943/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10943
View PR using the GUI difftool:
$ git pr show -t 10943
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10943.diff