-
Notifications
You must be signed in to change notification settings - Fork 6k
JDK-8274320: os::fork_and_exec() should be using posix_spawn #5698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK-8274320: os::fork_and_exec() should be using posix_spawn #5698
Conversation
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
103ddd6
to
859c893
Compare
859c893
to
42e4120
Compare
/label hotspot-runtime |
/label remove hotspot |
/label hotspot-runtime |
@tstuefe this pull request can not be integrated into git checkout JDK-8274320-os-fork-and-exec-should-be-using-posix-spawn
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
@tstuefe |
@tstuefe |
@tstuefe The |
Hi Thomas, There is a crucial difference between the requirements of os::fork_and_exec versus the Java Runtime.exec call - in the VM we need to use an async-signal-safe function where possible. fork() is async-signal-safe but posix_spawn is not. Cheers, |
Hi David, fork() is not async-signal safe either. Since fork() can cause the execution of at-fork handlers. https://man7.org/linux/man-pages/man7/signal-safety.7.html
Therefore I think we don't lose anything by moving to posix_spawn(). But we gain reliability in high-footprint scenarios. For me, this is an important point. Analysis options such as OnError should be very reliable. They are often used in support situations where time is short and customer patience is limited. Failing analysis causes more iterations or may make analysis even impossible. Cheers, Thomas |
Mailing list message from David Holmes on hotspot-runtime-dev: On 6/10/2021 4:52 pm, Thomas Stuefe wrote:
Sorry but I have to disagree. fork() is async-signal-safe, but if an David |
Hi David, I looked a bit closer, since I wanted to figure out whether the async-unsafeness of calls in error reporting actually matters. Because the problem is that these functions are not re-entrant, right? But we already have an intricate mechanism in place to guard against re-entrance errors, with our secondary signal handling. So the first time we enter error handling, we mark this thread as the reporting one and install the secondary signal handler. All subsequent invocations open a new frame, and we skip the error reporting steps that caused the last error. So for re-entrance problems like this:
we are covered: in VMError::report_and_die(), most steps are guarded against re-execution by the step-counting-logic inside (Note that this mechanism seems not well understood and bitrots: recent addition like the Wrt to
So it won't be re-executed if a secondary signal happens inside error handling itself. The only caveat here is that it does not guard us against problems if the non-reentrant function we call in signal handling is already atop of us on the stack:
But for this to happen, the signal must originate from posix_spawn itself, and be a synchronous error signal which causes us to invoke error handling. So, posix_spawn() needs to be crashy in the first place. I'd argue that the chances for this to happen are very slim, unless the libc itself is broken. fork() is async signal safe only if no atfork handlers are used. We don't know that since we share the process with other entities, including system libraries themselves. I even dimly remember reading that the glibc itself using atfork handlers for internal cleanup, but cannot come up with a prove. But using atfork handlers is a common technique used by libraries to close mutexes on fork. So the current fork() never has been completely async signal safe either. posix_spawn has the charm that it allows us to circumvent a very common problem with forking in low-memory situations. Like vfork(), but with less risk involed. We analyzed this ([1]) when @dmlloyd proposed to exchange vfork against posix_spawn in Runtime.exec(). He convinced me that this is a good idea. posix_spawn(), at least on glibc and muslc, uses So, we in exchange for a theoretical problem which I think is very narrow, we'd get reliability in common situations (VM with high footprint). I think that tradeoff makes sense. Cheers, Thomas [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-October/056158.html |
Mailing list message from David Holmes on hotspot-runtime-dev: Hi Thomas, On 7/10/2021 4:25 pm, Thomas Stuefe wrote:
It's not the reentrancy problem I'm concerned with but simply the My worry is that we may hit cases where posix_spawn just causes the VM I know we already take many risks in the error reporting code with I understand the desire to have a more reliable forking mechanism with I'm not aware of our existing use of fork() being flagged as causing I'm happy to hear other opinions on this. Cheers, |
Hi David, I see what you mean. Okay, hanging would be bad. Lets wait for other opinions. @fweimer do you have an opinion about whether posix_spawn is more async-unsafe than fork, at least for the glibc? ..Thomas |
When it comes to Linux/glibc, Linux kernel has Copy-On-Write feature, doesn't it? So this change can't improve Linux further. On the other side, Darwin can take benefit from it by changing from fork() to posix_spawn(). If so, why not just use ifdef APPLE posix_spwarn() on Darwin? VMError has already used the macro for the vfork issue. |
Hi Xin, thanks for giving this an eye.
Not completely true, since AFAIU you need to at least copy the page tables. posix_spawn (if it uses clone(VFORK) under the hood) should not need to copy the page tables.
My aim was to simplify the code, not make it more complex. If the argument is "posix_spawn is dangerous during error reporting" we should not use it at all. If that argument does not hold, we can use it on all platforms. Thanks, Thomas |
Mailing list message from Florian Weimer on hotspot-runtime-dev: * David Holmes:
Sorry, I didn't see the thread until now. The glibc implementation of fork is not async-signal-safe even if the Using the fork function in signal handlers The current implementation of posix_spawn in glibc is async-signal-safe, The glibc implementation of posix_spawn has changed substantially over Other functions in the posix_spawn corner (for maintaining file actions) When used carefully, vfork can be made async-signal-safe. But you VMError::report_and_die() seems to call fopen for the replay data file. Thanks, |
Mailing list message from David Holmes on hotspot-runtime-dev: Hi Florian, On 22/10/2021 5:34 am, Florian Weimer wrote:
I'm surprised then that we have not encountered any such reported https://sourceware.org/bugzilla/show_bug.cgi?id=4737 especially the report that fork() is no longer required to be Cheers,
|
Mailing list message from Thomas Stüfe on hotspot-runtime-dev: Hi Florian, David, @Florian: thanks a lot for digging this up! The oldest glibc release we need is difficult to estimate, it depends on --- So IUUC we could deadlock today with fork() too, if we crash inside malloc. The worst thing which can happen is that we hang. David is right, that As a future improvement, it may make sense to extend the Cheers, Thomas On Fri, Oct 22, 2021 at 9:19 AM David Holmes <david.holmes at oracle.com>
|
Mailing list message from Florian Weimer on hotspot-runtime-dev: * Thomas St?fe:
Red Hat Enterprise Linux 7 is based on glibc 2.17. Its posix_spawn Adoptium builds still target el7 and include OpenJDK 17. I don't know
There's one caveat: posix_spawn is not *guaranteed* to be Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks technically good to me, but I can not judge if we can rely on async-signal-safety of posix_spawn. I guess we have to wait until glibc folks have discussed that.
@tstuefe This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 432 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thank you Martin. @dholmes-ora : do you have an idea how we could resolve this deadlock? Even with input from the glibc folks, a number of questions will remain unanswered (eg what about the other libcs). Still, I think my patch would improve the situation, which is murky UB now and will remain somewhat murky. But I also think this is not worth many more brain cycles from either one of us, so if you prefer I can withdraw this PR. I am fine either way. Cheers, Thomas |
Mailing list message from David Holmes on hotspot-runtime-dev: Hi Thomas, On 29/10/2021 8:47 pm, Thomas Stuefe wrote:
I've withdrawn my objection to this patch (see response to Florian). Cheers, |
Thank you David! Would you then approve it or should I look for a second Reviewer? |
Hi Thomas, I've approved it so we can start baking it. I hope it will get sufficient test coverage. Thanks, |
Thanks @dholmes-ora. We should extend the OnError tests somewhat. /integrate |
Going to push as commit 158831e.
Your commit was automatically rebased without conflicts. |
Hi, may I have reviews for this small patch please?
os::fork_and_exec()
, used in the hotspot to spawn child programs (scripts etc) in error situations, should be usingposix_spawn()
.ATM it uses either
fork()
orvfork()
.vfork()
got deprecated on MacOS and we get build errors (JDK-8274293) - even though in this case it would be completely fine to use. This leaves us withfork()
for MacOS, which has the known problems with large-footprint-parents. This matters here especially since we also use os::fork_and_exec to implement-XX:OnError
for OOM situations.We already use posix_spawn() as default for Runtime.exec() since JDK 15, and it is available on all our Unices. We also should use it here.
I kept the name of the function (fork_and_exec) since people know it, even though it's more incorrect now than before.
Tests:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5698/head:pull/5698
$ git checkout pull/5698
Update a local copy of the PR:
$ git checkout pull/5698
$ git pull https://git.openjdk.java.net/jdk pull/5698/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 5698
View PR using the GUI difftool:
$ git pr show -t 5698
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5698.diff