Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdin returns EAGAIN after system("mpiexec ...") #1782

Closed
mpichbot opened this issue Oct 14, 2016 · 4 comments · Fixed by #2755
Closed

stdin returns EAGAIN after system("mpiexec ...") #1782

mpichbot opened this issue Oct 14, 2016 · 4 comments · Fixed by #2755
Assignees
Milestone

Comments

@mpichbot
Copy link

mpichbot commented Oct 14, 2016

Originally by goodell on 2013-01-16 13:40:26 -0600


Originally reported on stackoverflow.com: http://stackoverflow.com/questions/14167620/stdin-seems-to-be-broken-after-call-to-system-invoking-mpiexec

Basically, a read syscall on fd 0 is returning EAGAIN for some reason after a parent process runs mpiexec in a subshell. For higher-level libraries like std::getline or libc's getline, this usually translates into something that looks like stdin has been closed.

I poked at it a while and couldn't figure out what's going on. The strace does not show mpiexec doing anything surprising. For example, we are not explicitly setting fd 0 to O_NONBLOCK AFAICS.

For debugging, I intentionally "broke" the proxy so that mpiexec could not exec it and the problem still occurs. The problem also still occurs whether or not the proxy is launched locally with fork+exec or with SSH. So whatever is inducing the problem lives entirely in the mpiexec executable.

@mpichbot mpichbot self-assigned this Oct 14, 2016
@mpichbot mpichbot added this to the mpich-3.3 milestone Oct 14, 2016
@mpichbot
Copy link
Author

Originally by buntinas on 2013-01-16 13:48:10 -0600


May be related to #1622???
-d

@mpichbot mpichbot modified the milestones: mpich-3.1.1, mpich-3.3, mpich-3.2 Oct 14, 2016
@raffenet raffenet assigned raffenet and unassigned mpichbot Oct 17, 2016
@pkgw
Copy link

pkgw commented May 3, 2017

My colleagues and I have run into this problem and have traced it down a bit.

Mpich's mpirun does indeed seem to have a bug where it sets the O_NONBLOCK flag on the stdout/stderr file descriptors of mpirun in common modes. These settings linger after mpirun exits. However, this bug does not generally surface because bash has code that turns the non-blocking flags of its attached streams off after each command that it runs, if necessary.

The code that's directly responsible in mpirun is in alloc_fwd_hash in src/pm/hydra/utils/sock.c. It's triggered by code in places like src/pm/hydra/tools/bootstrap/external/external_common_launch.c in the HYDT_bscd_common_launch_procs function, where there is some demuxer registration step that involves the callback HYDT_bscu_stdio_cb that eventually calls the alloc_fwd_hash function. There are a couple of other similar pieces of code that also set up the HYDT_bscu_stdio_cb callback and, by my reading of the code, should trigger the same issue. I do not know enough about mpich to understand which piece is at fault, but I do believe that it is inappropriate for mpich to set TTY file descriptors to nonblocking mode.

You can trigger this bug by running mpirun on OS X using the Homebrew build of bash. Homebrew's bash is version 4.4, which, for reasons not entirely clear to me, does not seem to activate the piece of code that clears the O_NONBLOCK flag. The default version of bash on OSX is version 3, which does clear the flag, preventing the bug from surfacing. The Macports version of Bash 4 also surfaces the bug.

@pkgw
Copy link

pkgw commented May 3, 2017

Also, from what I've seen, it seems as if the flag only gets turned on when the mpirun process exits, not at startup.

CC @guillochon

@raffenet
Copy link
Contributor

@pavanbalaji I am able to reproduce this bad behavior with bash 4.4 on OSX as suggested. I think we should consider #2755 for inclusion in 3.3 and 3.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants