Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First SIGCHLD handler (#5) #9

Closed
wants to merge 1 commit into from
Closed

First SIGCHLD handler (#5) #9

wants to merge 1 commit into from

Conversation

gaborcsardi
Copy link
Contributor

Minimalistic, but it already gets rid of the zombie processes.

Currently it interferes with parallel's fork clusters, because neither packages keep the other's signal handler. We would need to work out some cooperative solution, which would need to go in base R, to avoid this. E.g. storing a list of handlers for the various signals and add API to R to install/uninstall them.

You can merge this if you want, and then we can discuss improvements, especially about storing the exit status, so that one can query it from R.

Minimalistic, but it already gets rid of the zombie processes.
@gaborcsardi gaborcsardi changed the title First SIGCHLD handler First SIGCHLD handler (#5) Mar 10, 2017
@gaborcsardi
Copy link
Contributor Author

Forgot to refer to #5.

@jeroen
Copy link
Owner

jeroen commented Mar 11, 2017

Thanks. I'm not too experience with implementing signal handling. Some questions:

  • What if the encapsulating system already uses signal handlers? For example RApache or the CRAN check machinery use signal handlers to detect when R has crashed. Will overriding the signal handler in R affect the behavior of R's parent system?
  • AFAIK, Linux propagates each signal to the entire process group. So the process might receive signals from another process that is not it's child? Is it harmless to call waitpid() in this case?

@s-u

@gaborcsardi
Copy link
Contributor Author

gaborcsardi commented Mar 11, 2017

What if the encapsulating system already uses signal handlers? For example RApache or the CRAN check machinery use signal handlers to detect when R has crashed. Will overriding the signal handler in R affect the behavior of R's parent system?

Yes, this overwrites the other SIGCHLD signal handlers that were created in the same process. Rapache does not bother with SIGCHLD, apparently: https://github.com/jeffreyhorner/rapache/search?utf8=%E2%9C%93&q=SIGCHLD

It will interfere with parallel, though. Maybe we can work around that by keeping the signal handler installed by parallel. Maybe this is possible. But even if it is not possible, an interference with parallel's fork clusters is still better than the current zombie army.

AFAIK, Linux propagates each signal to the entire process group. So the process might receive signals from another process that is not it's child? Is it harmless to call waitpid() in this case?

I am not sure what other processes might be in the same process group. But anyway, yes, waitpid is harmless. The WNOHANG ensures that if it does not happen to find a child process that has just exited, then it will return immediately.

Actually, I just realized that I can implement all this in another package, it does not have to be in sys. So this is not a blocker for me. OTOH, to reliably get the exit status on windows requires modifications in sys I think, so it would be still better to have the whole thing in sys.

@jeroen
Copy link
Owner

jeroen commented Mar 11, 2017

RApache wraps apache2 prefork() mechanism for lanching R workers, which might use SIGCHLD?

Do you understand why the PR is failing unit tests? It seems to affect the error handling in R? https://travis-ci.org/jeroenooms/sys/builds/209904805

It would be nice to get a clean solution in sys. But the current solution does not allow you to get the exit status anymore, correct? I suppose sys_sigchld_callback() would need to store wstat in some global table that maps each pid to the wstat exit code, so that we check the process status and exit code from R?

@jeroen
Copy link
Owner

jeroen commented Mar 11, 2017

Ah I see why it fails now. Currently exec_wait uses waitpid() internally to get the exit code for the process, but you're already doing that in the SIGCHLD handler now. So after that we can no longer get the exit code manually.

@jeroen
Copy link
Owner

jeroen commented Mar 11, 2017

Also if the purpose is only to prevent zombies at this point, perhaps we can use SA_NOCLDWAIT?

@jeroen
Copy link
Owner

jeroen commented Mar 11, 2017

Merged via 28cb317. Still not 100% sure though this is final, seems to have a lot of side effects...

@jeroen jeroen closed this Mar 11, 2017
@gaborcsardi
Copy link
Contributor Author

What side effects?

@jeroen
Copy link
Owner

jeroen commented Mar 12, 2017

If you add a printf to the signal handler, you see that it gets triggered a lot for processes that were not launched by sys. Especially rstudio seems to launch lot of processes (git, pandoc, etc). Rstudio will not be able to get the exit code for these procs because our handler cleans them as soon as the child dies.

Also this excerpt from NEWS.rd makes me worried that we might be breaking other packages depend on a SIGCHLD.

  \R{} is now trying harder to not cleanup child processes
  that were not spawned by \code{mcparallel()} on platforms that
  provide information about the source process of the \code{SIGCHLD}
  signal. This allows 3rd party libraries to manage the exit status
  of children that they spawn without \R{} interfering.

Would there be some trick to mark a process when we launch it in sys, and only waitpid() on it when it was launched by sys?

@jeroen
Copy link
Owner

jeroen commented Mar 12, 2017

Moving this to a new issue: #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants