Skip to content

8250637: UseOSErrorReporting times out (on Mac and Linux) #813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

8250637: UseOSErrorReporting times out (on Mac and Linux) #813

wants to merge 6 commits into from

Conversation

gerard-ziemski
Copy link

@gerard-ziemski gerard-ziemski commented Oct 22, 2020

hi all,

Please review this simple fix for POSIX platforms, which addresses a time out that occurs while handling a crash with UseOSErrorReporting turned ON.

It appears that "UseOSErrorReporting" flag was only ever meant to be used on Windows platform and was mistakenly left available for other platforms. In this fix we make sure to only use the flag on Windows platform and make it a NOP for other platforms.

Note #1: A similar hang issue occurs today even on Windows, with the only difference being that before a process times out (takes 2 minutes) it runs out of stack space in about 250 loops, so that's the only reason it doesn't linger for that long. Windows issue is tracked separately by https://bugs.openjdk.java.net/browse/JDK-8250782

Note #2: Creating native crash log (on macOS) is a non-trivial, research wise effort, that is tracked by https://bugs.openjdk.java.net/browse/JDK-8237727

Note #3 Removal of the "UseOSErrorReporting" flag will be depended on whether we can do #2 and at that time we can decide whether to keep it and implement it for other platforms or whether to remove it, provided that #2 can not be done reliably.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux x32 Linux x64 Windows x64 macOS x64
Build ❌ (1/1 failed) ✔️ (5/5 passed) ✔️ (2/2 passed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) ❌ (1/9 failed) ✔️ (9/9 passed)

Failed test tasks

Issue

  • JDK-8250637: UseOSErrorReporting times out (on Mac and Linux)

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/813/head:pull/813
$ git checkout pull/813

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 22, 2020

👋 Welcome back gziemski! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 22, 2020

@gerard-ziemski The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 22, 2020
@gerard-ziemski gerard-ziemski marked this pull request as ready for review October 22, 2020 17:27
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 22, 2020
@mlbridge
Copy link

mlbridge bot commented Oct 22, 2020

Webrevs

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gerard,

I have general concerns about the usefulness of this switch, see the comments in the JBS issue. Beyond that, some remarks below.

Cheers, Thomas

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gerard,

I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.

For non-Windows there is no pre-established alternative code path for report_and_die() returning.

In the bug report you write:

On Mac/Linux it would look more like this:

#1 catch signal in our handler
#2 generate hs_err log
#3 turn off our signal handler
#4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated

To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.

I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)

Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.

So my preferred approaches here would be:

  1. Make UseOSErrorReporting Windows only; or
  2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

Thanks,
David

@dholmes-ora
Copy link
Member

From:

https://bugs.openjdk.java.net/browse/JDK-6227246

"Iimplemented Windows-only flag -XX+UseOSErrorReporting which allows us instead of running of our crash handler and dying, forward exception handling to the OS in case of actual crash. "

but there were issues with the integration of the fix:

"Some of the changes to this fix weren't integrated or were merged out by mistake."

and we ended up with a shared flag. I can see a comment in the original putback:

"Make UseOSErrorReporting platform independant so linux can use someday and because used from os independant code."

which is why this ended up not being Windows-only even though it only worked in a meaningful way on Windows.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing review status to "Request changes".

@tstuefe
Copy link
Member

tstuefe commented Oct 26, 2020

So my preferred approaches here would be:
  1. Make UseOSErrorReporting Windows only; or
  2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

I like (2). It is sure to preserve the stack of the crashing thread. Not perfect, but maybe its close to what Gerard likes to see on MacOS.

Only remark, this gets very close to what we do already, since os::abort() calls ::abort() which raises SIGABORT... but according to Gerard abort() does not seem to get noticed by MacOS crash handling. So artificially triggering a fault may be better.

..Thomas

Thanks,
David

@gerard-ziemski
Copy link
Author

gerard-ziemski commented Oct 26, 2020

Hi Gerard,

I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.

For non-Windows there is no pre-established alternative code path for report_and_die() returning.

In the bug report you write:

On Mac/Linux it would look more like this:
#1 catch signal in our handler
#2 generate hs_err log
#3 turn off our signal handler
#4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated

To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.

I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)

Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.

So my preferred approaches here would be:

  1. Make UseOSErrorReporting Windows only; or
  2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

hi David,

Many thanks for the review and finding the background info on the history of this issue.

How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.

On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:

"unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."

That's how Apple suggest we do it for Mac.

I can limit the scope of this fix to just macOS here, like I was planning it for JDK-8237727, and for Linux simply disable the flag for now and leave any more sophisticated fix for a next issue. I do think, however, that on Linux anything better than 2 min hang would be better.

@mlbridge
Copy link

mlbridge bot commented Oct 26, 2020

Mailing list message from David Holmes on hotspot-dev:

On 27/10/2020 1:35 am, Gerard Ziemski wrote:

On Mon, 26 Oct 2020 04:33:03 GMT, David Holmes <dholmes at openjdk.org> wrote:

Hi Gerard,

I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.

For non-Windows there is no pre-established alternative code path for report_and_die() returning.

In the bug report you write:

On Mac/Linux it would look more like this:
#1 catch signal in our handler
#2 generate hs_err log
#3 turn off our signal handler
#4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated

To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.

I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)

Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.

So my preferred approaches here would be:

1. Make UseOSErrorReporting Windows only; or
2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

hi David,

Many thanks for the review and finding the background info on the history of this issue.

How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.

No there is a semantic underpining as to what it means for there to be
OS error reporting on a given platform. Windows has a nicely defined
model. Other platforms not so nice. On macOS they really don't want apps
to attempt any kind of crash handling on their own. :)

On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:

"unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."

That's how Apple suggest we do it for Mac.

That is a blog by an Apple developer giving some very general advice,
and IMO lacking in some necessary detail. That quote above is in the
context of answering:

"Finally, there?s the question of how to exit from your signal handler."

The suggestion to "then return" hits UB for the synchronous error
signals - a fact not mentioned in the blog entry. The assertion that:

"This will cause the crashed process to continue execution, crash again,
... "

is a naive oversimplification. If you just seg-faulted doing a read from
memory how can you continue execution? What does that mean when the read
yielded no value? Will you just continue with a random value? Will the
system try to re-execute the read and so crash again? Maybe it will
crash again, maybe it won't. Maybe it will do something in the meantime
that leads to totally unexpected behaviour (as Thomas previously
described). Hence my suggestion that if you are going to attempt this
path for macOS then you need to introduce the second crash so we know
exactly what will happen. Returning from the original signal handler is
not an option IMO.

I can limit the scope of this fix to just macOS here, like I was planning it for JDK-8237727 and worry about Linux in a different issue.

Yes please limit to macOS only. We should look at how to remove the flag
from platforms where it has no well-defined meaning.

Thanks,
David
-----

@gerard-ziemski gerard-ziemski marked this pull request as draft October 27, 2020 15:44
@openjdk openjdk bot removed the rfr Pull request is ready for review label Oct 27, 2020
@gerard-ziemski
Copy link
Author

gerard-ziemski commented Oct 27, 2020

Mailing list message from David Holmes on hotspot-dev:

On 27/10/2020 1:35 am, Gerard Ziemski wrote:

On Mon, 26 Oct 2020 04:33:03 GMT, David Holmes wrote:

Hi Gerard,
I think we have a fundamental problem here that UseOSErrorReporting was only ever intended for use on Windows. It simply allows VMError::report_and_die to return instead of actually making the VM "die". For Windows this means we can continue to propagate the windows exception and thus allow Windows Error Reporting (WER) to take over. Whether this actually works correctly or not is a different matter.
For non-Windows there is no pre-established alternative code path for report_and_die() returning.
In the bug report you write:

On Mac/Linux it would look more like this:
#1 catch signal in our handler
#2 generate hs_err log
#3 turn off our signal handler
#4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated

To me you are now inventing what UseOSErrorReporting should mean on non-Windows, and I don't agree with it. I don't think it should mean that we re-crash using the "default" signal response and consider that as using "OS error reporting". To me that is just not valid, especially when we cannot return from a signal handling context in many cases without incurring undefined behaviour. To me #4 is not a valid expectation as we have no way to know what will happen next if the signal handler returns. It would also be wrong to just continue execution after an assertion or guarantee fails.
I'm assuming that the motivation here is that on macOS if we use the default signal handling modes then macOS will do its own error reporting? If so I would suggest that the right response may be to return from report_and_die (on macOS only) and then deliberately crash after restoring the default handler. Obviously that will change which "crash" the OS reports but that is likely to happen anyway as you cannot guarantee how you will crash after trying to continue (and this goes beyond our general "best effort" approaches in signal handling.)
Beyond that I share Thomas's concerns about making sweeping changes to installed signal handlers.
So my preferred approaches here would be:

  1. Make UseOSErrorReporting Windows only; or
  2. Make UseOSErrorReporting Windows and macOS only. Then on macOS do a targeted crash after report_and_die() returns.

hi David,
Many thanks for the review and finding the background info on the history of this issue.
How we do things when a user turns ON the "UseOSErrorReporting" flag is just an implementation detail.

No there is a semantic underpining as to what it means for there to be
OS error reporting on a given platform. Windows has a nicely defined
model. Other platforms not so nice. On macOS they really don't want apps
to attempt any kind of crash handling on their own. :)

On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
"unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
That's how Apple suggest we do it for Mac.

That is a blog by an Apple developer giving some very general advice,
and IMO lacking in some necessary detail. That quote above is in the
context of answering:

"Finally, there?s the question of how to exit from your signal handler."

The suggestion to "then return" hits UB for the synchronous error
signals - a fact not mentioned in the blog entry. The assertion that:

"This will cause the crashed process to continue execution, crash again,
... "

is a naive oversimplification. If you just seg-faulted doing a read from
memory how can you continue execution? What does that mean when the read
yielded no value? Will you just continue with a random value? Will the
system try to re-execute the read and so crash again? Maybe it will
crash again, maybe it won't. Maybe it will do something in the meantime
that leads to totally unexpected behaviour (as Thomas previously
described). Hence my suggestion that if you are going to attempt this
path for macOS then you need to introduce the second crash so we know
exactly what will happen.

But that will show up as a different crash and might be confusing.

Returning from the original signal handler is
not an option IMO.

I think our differences of opinion all hinges on what happens when code returns from its signal handler:

#1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?)
#2 Does it resume and raise the exact same signal? (exhibits the exact same behavior as original?)
#3 Does it resume past the instruction that originally caused the exception?

You and Thomas seem to believe that it's #3 (or is that #1 ?), I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2.

I will continue this investigation in JDK-8237727

Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux.

…ash with UseOSErrorReporting"

This reverts commit f634064.
@openjdk
Copy link

openjdk bot commented Oct 27, 2020

@gerard-ziemski This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8250637: UseOSErrorReporting times out (on Mac and Linux)

Reviewed-by: stuefe, dholmes

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 172 new commits pushed to the master branch:

  • 50357d1: 8254723: add diagnostic command to write Linux perf map file
  • f97ec35: 8255785: X11 libraries should not be required by configure for headless only
  • 184db64: 8255732: OpenJDK fails to build if $A is set to a value with spaces
  • c774741: 8255695: Some JVMTI calls in the jdwp debug agent are using FUNC_PTR instead of JVMTI_FUNC_PTR
  • bee864f: 8255766: Fix linux+arm64 build after 8254072
  • ceba2f8: 8255696: JDWP debug agent's canSuspendResumeThreadLists() should be removed
  • a250716: 8255694: memory leak in JDWP debug agent after calling JVMTI GetAllThreads
  • acb5f65: 8211958: Broken links in java.desktop files
  • bc6085b: 8255578: [JVMCI] be more careful about reflective reads of Class.componentType.
  • 05bcd67: 8255529: Remove unused methods from java.util.zip.ZipFile
  • ... and 162 more: https://git.openjdk.java.net/jdk/compare/4634dbef6d554b6f091dd7893e266682b267757f...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed hotspot hotspot-dev@openjdk.org labels Oct 27, 2020
@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org and removed ready Pull request is ready to be integrated labels Oct 27, 2020
@gerard-ziemski gerard-ziemski marked this pull request as ready for review October 27, 2020 17:24
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 27, 2020
@tstuefe
Copy link
Member

tstuefe commented Oct 28, 2020

I think our differences of opinion all hinges on what happens when code returns from its signal handler:

#1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?)
#2 Does it resume and raise the exact same signal? (exhibits the exact same behavior as original?)
#3 Does it resume past the instruction that originally caused the exception?

You and Thomas seem to believe that it's #3 (or is that #1 ?), I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2.

No, not #3.

#2 is an interesting thought, but I don't think so. Were it so, our polling page mechanism would not work: triggering a SEGV by accessing a poisened page, and in signal handling, unpoisening the page and returning, which then re-executes the same load, but since the page is now unpoisened no fault happens. Which, btw, is an excellent example of a case where returning from a signal handler does not re-raise the same signal. On purpose in this case, but our point is that the same thing may happen accidentally.

I think what happens is that the register contents - so, the crash context - which had been active when the thread got the first fault gets reinstated after signal handler returns, and we resume processing with the same state. So, all registers are the same, including pc. We would attempt to reload the instruction from the same address and re-execute it. But since the underlying memory could have changed in the meantime (starting at: the point the pc points to had been invalid and is now valid, e.g. a bug in the JIT, to: the instruction was a mov/store and its destination had been invalid and is now valid, and so on) there are conceivable scenarios where we may not crash a second time.

I will continue this investigation in JDK-8237727

Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux.

This is reasonable, thank you.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please do a small cleanup:

UseOSErrorReporting is defined as pd flag, with definitions in all os-dependent globals.. files. Unnecessarily, since the default value is always false. We could remove the pd definitions and just make this a normal flag in globals.hpp.

(Would be cleaner to move it to globals_windows.hpp but this would probably need a csr since its a product flag)

@mlbridge
Copy link

mlbridge bot commented Oct 28, 2020

Mailing list message from David Holmes on hotspot-dev:

<trimming>

On 28/10/2020 2:08 am, Gerard Ziemski wrote:

On Mon, 26 Oct 2020 15:32:49 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

On Windows we forward the crash to the OS to handle it, but just because in this fix we "just" turn off our signal handlers, reset them to SIG_DFL and return to let it crash again, does not mean it's not a meaningful way to forward it to OS, if that's how the OS wants it - please see this technical note from Apple https://developer.apple.com/forums/thread/113742 where Apple suggest the way to let the macOS handle the crash is to:
"unregister your signal handler (set it to SIG_DFL) and then return. This will cause the crashed process to continue execution, crash again, and generate a crash report via the Apple crash reporter."
That's how Apple suggest we do it for Mac.

That is a blog by an Apple developer giving some very general advice,
and IMO lacking in some necessary detail. That quote above is in the
context of answering:

"Finally, there?s the question of how to exit from your signal handler."

The suggestion to "then return" hits UB for the synchronous error
signals - a fact not mentioned in the blog entry. The assertion that:

"This will cause the crashed process to continue execution, crash again,
... "

is a naive oversimplification. If you just seg-faulted doing a read from
memory how can you continue execution?

My understanding is that we would not be going to continue execution past the seg-faulted instruction, but instead resume at the seg-fault instruction (with the same memory/register contents, unless our signal handler modified any of that), which would cause the same signal to be raised at the exact same frame, resulting in the exact same behavior. That's what my experimentation shows and what I understood the Apple's recommendation is based on.

What does that mean when the read
yielded no value? Will you just continue with a random value? Will the
system try to re-execute the read and so crash again? Maybe it will
crash again, maybe it won't. Maybe it will do something in the meantime
that leads to totally unexpected behaviour (as Thomas previously
described). Hence my suggestion that if you are going to attempt this
path for macOS then you need to introduce the second crash so we know
exactly what will happen.

But that will show up as a different crash and might be confusing.

Returning from the original signal handler is
not an option IMO.

I think our differences of opinion all hinges on what happens when code returns from its signal handler:

#1 Does it resume and actually redoes the exact same instruction? (which this time may succeed?)
#2 Does it resume and raise the exact same signal? (exhibits the exact same behavior as original?)

You and Thomas seem to believe that it's #1, I thought (based on https://developer.apple.com/forums/thread/113742 ) that it was more like #2.

My position was based purely on the POSIX specification that returning
from a signal handler, for specific signals, leads to undefined
behaviour. I had overlooked (thanks Thomas for flagging it!) the fact
that we already utilise returning normally from signal handlers for a
range of things - safepoint/handshake polls; implicit null pointer checks.

So I was looking for something more definitive from macOS that things
would work as you suggest. And the sigaction manpage does seem to
suggest that:

"The call to the handler is arranged so that if the signal handling
routine returns normally the process will resume execution in the
context from before the signal's delivery."

So as Thomas discusses the issue is not whether #1 or #2 is correct, as
they both are, it just depends on the exact context of the original
signal whether re-executing the failed instruction will fail again, or
whether it could succeed. While I can imagine general scenarios where
the instruction could now succeed, I don't know how realistic they are
in the JVM context.

I will continue this investigation in JDK-8237727

Here I will not be as ambitious and I will simply fix the problem at hand: i.e. address the 2 minutes hang by disabling the option for macOS and Linux.

Okay.

Thanks,
David
-----

@mlbridge
Copy link

mlbridge bot commented Oct 28, 2020

Mailing list message from David Holmes on hotspot-dev:

On 28/10/2020 5:07 pm, Thomas Stuefe wrote:

On Tue, 27 Oct 2020 17:06:32 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

hi all,

Please review this simple fix for POSIX platforms, which addresses a time out that occurs while handling a crash with UseOSErrorReporting turned ON.

It appears that "UseOSErrorReporting" flag was only ever meant to be used on Windows platform and was mistakenly left available for other platforms. In this fix we make sure to only use the flag on Windows platform and make it a NOP for other platforms.

Note #1: A similar hang issue occurs today even on Windows, with the only difference being that before a process times out (takes 2 minutes) it runs out of stack space in about 250 loops, so that's the only reason it doesn't linger for that long. Windows issue is tracked separately by https://bugs.openjdk.java.net/browse/JDK-8250782

Note #2: Creating native crash log (on macOS) is a non-trivial, research wise effort, that is tracked by https://bugs.openjdk.java.net/browse/JDK-8237727

Note #3 Removal of the "UseOSErrorReporting" flag will be depended on whether we can do #2 and at that time we can decide whether to keep it and implement it for other platforms or whether to remove it, provided that #2 can not be done reliably.

Gerard Ziemski has updated the pull request incrementally with two additional commits since the last revision:

- Only use UseOsErrorReporting on Windows
- Revert "reset signal handlers to their system defaults if handling crash with UseOSErrorReporting"

This reverts commit f6340643974f3e0cc3ab95fbbba51b23b8d9af31\.

Could you please do a small cleanup:

UseOSErrorReporting is defined as pd flag, with definitions in all os-dependent globals.. files. Unnecessarily, since the default value is always false. We could remove the pd definitions and just make this a normal flag in globals.hpp.

(Would be cleaner to move it to globals_windows.hpp but this would probably need a csr since its a product flag)

Any behavioural change in the existing product flag also requires a CSR
request so we may as well make this truly windows-only, then add in
macOS later.

Thanks,
David

@gerard-ziemski
Copy link
Author

gerard-ziemski commented Oct 28, 2020

Thank you Thomas and David, I'm learning a lot from your reviews!

Can you please take a look at the current fix in the Webrevs section?

I removed the flag from other platforms. This will require CSR approval.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI Gerard,

The patch is fine in its current form to me (including your last push). Whether or not to do a CSR I leave up to you and David.

As my final remark to our "return from signal handler" discussion: I'd probably be more chill if this were a simple application. Like vi :) But we do so many unusual things (including generating, then running our own code) and the VM is the base for such a large software stack that I rather be careful.

All my remaining remarks are nits. Take what you like, ignore the rest.

Thank you,

Thomas

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2020
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, but will require a trivial CSR request.

@dholmes-ora
Copy link
Member

/csr needed

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Oct 29, 2020
@openjdk
Copy link

openjdk bot commented Oct 29, 2020

@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@gerard-ziemski please create a CSR request and add link to it in JDK-8250637. This pull request cannot be integrated until the CSR request is approved.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 29, 2020
@tstuefe
Copy link
Member

tstuefe commented Oct 30, 2020

Looks all still good to me. Thank you for doing this!

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed csr Pull request needs approved CSR before integration labels Nov 3, 2020
@gerard-dl
Copy link

gerard-dl commented Nov 5, 2020

Hi @gerard-dl, thanks for making a comment in an OpenJDK project!

All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user gerard-dl for the summary.

If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.

Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use.

@openjdk
Copy link

openjdk bot commented Nov 5, 2020

@gerard-dl Only the author (@gerard-ziemski) is allowed to issue the integrate command.

@gerard-ziemski
Copy link
Author

/integrate

@openjdk openjdk bot closed this Nov 5, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 5, 2020
@openjdk
Copy link

openjdk bot commented Nov 5, 2020

@gerard-ziemski Since your change was applied there have been 240 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

Pushed as commit ba2ff3a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk-notifier bot referenced this pull request Nov 5, 2020
@gerard-ziemski gerard-ziemski deleted the JDK-8250637 branch November 18, 2020 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants