-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What "server did not get guid" means? #8257
Comments
Looking (briefly) at horovod/horovod#2477, if all your VMs/containers/nodes/whatevers have the same IP address, that's going to be a problem. Open MPI relies on having different IP addresses -- e.g., it looks like Open MPI may have tried to open IP address X from IP address X, and therefore it connected to itself (and then things went downhill from there). If you can provide each VM/container/node/whatever with a unique alias IP address and then use that set of IP addresses for Open MPI, that might work better. You can tell Open MPI to exclude the everyone-has-the-same-172.28.0.2 IP addresses via:
Or even just tell Open MPI exactly which subnet to use (the "include" and "exclude" directives are mutually exclusive):
|
@jsquyres Thanks for answer! Unfortunately in this case I haven't permissions to create aliases (RTNETLINK answers: Operation not permitted). So is there any way to force OpenMPI use hostname as identifier of node? |
Open MPI can and does use network names, but it always resolves them to IP addresses first, and then uses the resulting IP address for the unique network address of that peer. Sorry. |
I have an idea - I can build opempi from sources with hardcoded host ip address which using in that infrastructure (for each node). And it may work, because there are same ssh aliases in all nodes. So where I can found something like host address definition in openmpi code? |
I don't know if I fully understand what you mean. You can certainly make a But even with that, I don't know if that will be enough. Open MPI uses the IP address as a unique identifier for the peer to know that it has contacted a) the correct peer, and b) on the interface that it expected. Meaning: even if you write a per-host Generally speaking, the TCP BTL doesn't really handle the case where a peer's interface has a different public vs. private IP address. It is much more common for HPC environments to have shared IP address spaces. |
I too ran into this bizarre error, "WARNING: Open MPI accepted a TCP connection from what appears to be another Open MPI process but cannot find a corresponding process". After spending days of googling and lots of back and forth with the vendor, I found the root cause. From: https://www.mail-archive.com/users@lists.open-mpi.org/msg34182.html That typically occurs when some nodes have multiple interfaces, and So there was a virbr0 network on captainmarvel03 and captaimarvel01 with the same IP address. Per: https://forums.centos.org/viewtopic.php?t=61634 I disabled and stopped the libvirtd service and rebooted. Now everything works like before. The problem happened after I had installed GNOME desktop on captainmarvel03.. |
Hello!
This issue is just a question: what means "server did not get guid" in this error list? https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/tcp/help-mpi-btl-tcp.txt#L116 I got it when I try to run
mpirun
from master node with two workers (which same IP address in eth0 interface). In my case workers are behind NAT, and they connected to master node with ssh -R tunnel (master is going to workers with ssh by aliases)The text was updated successfully, but these errors were encountered: