-
Notifications
You must be signed in to change notification settings - Fork 6k
8250637: UseOSErrorReporting times out (on Mac and Linux) #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… UseOSErrorReporting
👋 Welcome back gziemski! A progress list of the required criteria for merging this PR into |
@gerard-ziemski The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gerard,
I have general concerns about the usefulness of this switch, see the comments in the JBS issue. Beyond that, some remarks below.
Cheers, Thomas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gerard,
I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
For non-Windows there is no pre-established alternative code path for report_and_die() returning.
In the bug report you write:
On Mac/Linux it would look more like this:
#1 catch signal in our handler
#2 generate hs_err log
#3 turn off our signal handler
#4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated
To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
So my preferred approaches here would be:
- Make UseOSErrorReporting Windows only; or
- Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.
Thanks,
David
From: https://bugs.openjdk.java.net/browse/JDK-6227246 "Iimplemented Windows-only flag -XX+UseOSErrorReporting which allows us instead of running of our crash handler and dying, forward exception handling to the OS in case of actual crash. " but there were issues with the integration of the fix: "Some of the changes to this fix weren't integrated or were merged out by mistake." and we ended up with a shared flag. I can see a comment in the original putback: "Make UseOSErrorReporting platform independant so linux can use someday and because used from os independant code." which is why this ended up not being Windows-only even though it only worked in a meaningful way on Windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing review status to "Request changes".
So my preferred approaches here would be: I like (2). It is sure to preserve the stack of the crashing thread. Not perfect, but maybe its close to what Gerard likes to see on MacOS. Only remark, this gets very close to what we do already, since os::abort() calls ::abort() which raises SIGABORT... but according to Gerard abort() does not seem to get noticed by MacOS crash handling. So artificially triggering a fault may be better. ..Thomas
|
hi David, Many thanks for the review and finding the background info on the history of this issue. How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail. On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to: "unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter." That's how Apple suggest we do it for Mac. I can limit the scope of this fix to just macOS here, like I was planning it for JDK-8237727, and for Linux simply disable the flag for now and leave any more sophisticated fix for a next issue. I do think, however, that on Linux anything better than 2 min hang would be better. |
Mailing list message from David Holmes on hotspot-dev: On 27/10/2020 1:35 am, Gerard Ziemski wrote:
No there is a semantic underpining as to what it means for there to be
That is a blog by an Apple developer giving some very general advice, "Finally, there?s the question of how to exit from your signal handler." The suggestion to "then return" hits UB for the synchronous error "This will cause the crashed process to continue execution, crash again, is a naive oversimplification. If you just seg-faulted doing a read from
Yes please limit to macOS only. We should look at how to remove the flag Thanks, |
But that will show up as a different crash and might be confusing.
I think our differences of opinion all hinges on what happens when code returns from its signal handler: #1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?) You and Thomas seem to believe that it's #3 (or is that #1 ?), I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2. I will continue this investigation in JDK-8237727 Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux. |
…ash with UseOSErrorReporting" This reverts commit f634064.
@gerard-ziemski This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 172 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
No, not #3. #2 is an interesting thought, but I don't think so. Were it so, our polling page mechanism would not work: triggering a SEGV by accessing a poisened page, and in signal handling, unpoisening the page and returning, which then re-executes the same load, but since the page is now unpoisened no fault happens. Which, btw, is an excellent example of a case where returning from a signal handler does not re-raise the same signal. On purpose in this case, but our point is that the same thing may happen accidentally. I think what happens is that the register contents - so, the crash context - which had been active when the thread got the first fault gets reinstated after signal handler returns, and we resume processing with the same state. So, all registers are the same, including pc. We would attempt to reload the instruction from the same address and re-execute it. But since the underlying memory could have changed in the meantime (starting at: the point the pc points to had been invalid and is now valid, e.g. a bug in the JIT, to: the instruction was a mov/store and its destination had been invalid and is now valid, and so on) there are conceivable scenarios where we may not crash a second time.
This is reasonable, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please do a small cleanup:
UseOSErrorReporting is defined as pd flag, with definitions in all os-dependent globals.. files. Unnecessarily, since the default value is always false. We could remove the pd definitions and just make this a normal flag in globals.hpp.
(Would be cleaner to move it to globals_windows.hpp but this would probably need a csr since its a product flag)
Mailing list message from David Holmes on hotspot-dev: <trimming> On 28/10/2020 2:08 am, Gerard Ziemski wrote:
My position was based purely on the POSIX specification that returning So I was looking for something more definitive from macOS that things "The call to the handler is arranged so that if the signal handling So as Thomas discusses the issue is not whether #1 or #2 is correct, as
Okay. Thanks, |
Mailing list message from David Holmes on hotspot-dev: On 28/10/2020 5:07 pm, Thomas Stuefe wrote:
Any behavioural change in the existing product flag also requires a CSR Thanks, |
Thank you Thomas and David, I'm learning a lot from your reviews! Can you please take a look at the current fix in the Webrevs section? I removed the flag from other platforms. This will require CSR approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI Gerard,
The patch is fine in its current form to me (including your last push). Whether or not to do a CSR I leave up to you and David.
As my final remark to our "return from signal handler" discussion: I'd probably be more chill if this were a simple application. Like vi :) But we do so many unusual things (including generating, then running our own code) and the VM is the base for such a large software stack that I rather be careful.
All my remaining remarks are nits. Take what you like, ignore the rest.
Thank you,
Thomas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, but will require a trivial CSR request.
/csr needed |
@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request. |
Looks all still good to me. Thank you for doing this! |
Hi @gerard-dl, thanks for making a comment in an OpenJDK project! All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user gerard-dl for the summary. If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.
Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use. |
@gerard-dl Only the author (@gerard-ziemski) is allowed to issue the |
/integrate |
@gerard-ziemski Since your change was applied there have been 240 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit ba2ff3a. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
hi all,
Please review this simple fix for POSIX platforms, which addresses a time out that occurs while handling a crash with UseOSErrorReporting turned ON.
It appears that "UseOSErrorReporting" flag was only ever meant to be used on Windows platform and was mistakenly left available for other platforms. In this fix we make sure to only use the flag on Windows platform and make it a NOP for other platforms.
Note #1: A similar hang issue occurs today even on Windows, with the only difference being that before a process times out (takes 2 minutes) it runs out of stack space in about 250 loops, so that's the only reason it doesn't linger for that long. Windows issue is tracked separately by https://bugs.openjdk.java.net/browse/JDK-8250782
Note #2: Creating native crash log (on macOS) is a non-trivial, research wise effort, that is tracked by https://bugs.openjdk.java.net/browse/JDK-8237727
Note #3 Removal of the "UseOSErrorReporting" flag will be depended on whether we can do #2 and at that time we can decide whether to keep it and implement it for other platforms or whether to remove it, provided that #2 can not be done reliably.
Progress
Testing
Failed test tasks
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/813/head:pull/813
$ git checkout pull/813