New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[openmpi]: Fix wait_mpi_ready()
hangs when using OpenMPI 2.1.3
#773
[openmpi]: Fix wait_mpi_ready()
hangs when using OpenMPI 2.1.3
#773
Conversation
mpiexec in openmpi v2.1.x doesn't exit properly when it couldn't resolve hostnames of workers. I didn't found any workaround to make mpiexec properly except for setting MPIEXEC_TIMEOUT.
/lgtm /approve |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiezhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/test kubeflow-presubmit |
1 similar comment
/test kubeflow-presubmit |
…shi (kubeflow#773) mpiexec in openmpi v2.1.x doesn't exit properly when it couldn't resolve hostnames of workers. I didn't found any workaround to make mpiexec properly except for setting MPIEXEC_TIMEOUT.
What is the problem
We mainly use Open MPI
2.1.3
because OpenMPI3.0.1
contains some bug on GPU Direct feature.But unfortunately
mpiexec
in OpenMPIv2.1.3
doesn't exit properly when it couldn't resolvehostnames of worker pods. Consequently
wait_mpi_ready()
loop ininit.sh
will hang forever.How the PR fixes it
The only way I found to stop hanged
mpiexec
was to setMPIEXEC_TIMEOUT
environ. So, I put this environ only in the loop ofwait_mpi_ready()
. I know it is a dirty hack though.ref: https://www.open-mpi.org/doc/v2.1/man1/mpiexec.1.php#sect4
I would be happy if someone know other workarounds for it 🙇
How to Reproduce
Build docker image (
Dockerfile
)This was published as
everpeace/kubeflow-openmpi-21x-test:latest
. So you don't need to build yourself. Please check what the image contains and proceed next step.Run
on
18b670dc3883ff733f840b25e3c65c6db6483637
that is current master branch head.Once master pod reach
Running
, You can seempiexec
hanged as below.This change is