Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What "server did not get guid" means? #8257

Closed
egorgam opened this issue Nov 25, 2020 · 6 comments
Closed

What "server did not get guid" means? #8257

egorgam opened this issue Nov 25, 2020 · 6 comments
Labels

Comments

@egorgam
Copy link

egorgam commented Nov 25, 2020

Hello!

This issue is just a question: what means "server did not get guid" in this error list? https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/tcp/help-mpi-btl-tcp.txt#L116 I got it when I try to run mpirun from master node with two workers (which same IP address in eth0 interface). In my case workers are behind NAT, and they connected to master node with ssh -R tunnel (master is going to workers with ssh by aliases)

@jsquyres
Copy link
Member

Looking (briefly) at horovod/horovod#2477, if all your VMs/containers/nodes/whatevers have the same IP address, that's going to be a problem. Open MPI relies on having different IP addresses -- e.g., it looks like Open MPI may have tried to open IP address X from IP address X, and therefore it connected to itself (and then things went downhill from there).

If you can provide each VM/container/node/whatever with a unique alias IP address and then use that set of IP addresses for Open MPI, that might work better. You can tell Open MPI to exclude the everyone-has-the-same-172.28.0.2 IP addresses via:

$ mpirun --mca btl_tcp_if_exclude 172.28.0.2/32 ...

Or even just tell Open MPI exactly which subnet to use (the "include" and "exclude" directives are mutually exclusive):

# Assuming you make unique IP aliases in the 10.1.2.0/24 range somewhere.
$ mpirun --mca btl_tcp_if_include 10.1.2.0/24 ...

@egorgam
Copy link
Author

egorgam commented Nov 30, 2020

@jsquyres Thanks for answer! Unfortunately in this case I haven't permissions to create aliases (RTNETLINK answers: Operation not permitted). So is there any way to force OpenMPI use hostname as identifier of node?

@jsquyres
Copy link
Member

Open MPI can and does use network names, but it always resolves them to IP addresses first, and then uses the resulting IP address for the unique network address of that peer. Sorry.

@egorgam egorgam closed this as completed Nov 30, 2020
@egorgam
Copy link
Author

egorgam commented Nov 30, 2020

I have an idea - I can build opempi from sources with hardcoded host ip address which using in that infrastructure (for each node). And it may work, because there are same ssh aliases in all nodes. So where I can found something like host address definition in openmpi code?

@egorgam egorgam reopened this Nov 30, 2020
@jsquyres
Copy link
Member

I don't know if I fully understand what you mean.

You can certainly make a /etc/hosts file with multiple different hostnames and different IP addresses, where the IP addresses correspond to the public IP address for remote peers (I'm assuming that each peer has the same private IP address but different public IP addresses, where "public" is loosely defined as "public within the scope of your horovod run"). You'll still need to plumb through various ports between the public and private IP interfaces, too (e.g., Open MPI makes TCP sockets on effectively random port numbers between peers -- you can control those socket port number ranges, if you want, but it's more setup to do).

But even with that, I don't know if that will be enough.

Open MPI uses the IP address as a unique identifier for the peer to know that it has contacted a) the correct peer, and b) on the interface that it expected. Meaning: even if you write a per-host /etc/hosts file with the public IP addresses of the peers, Open MPI will still fail the IP address check during the connection handshake. This check was put in place because we had many cases where users had either erroneous or complicated networking environments which let to Open MPI either contacting the wrong peer or the wrong interface on the right peer, and then Open MPI would get confused and eventually the job would fail. We therefore put in the "make sure we've connected to the right peer and the right interface" check to prevent these kinds of problems from occurring.

Generally speaking, the TCP BTL doesn't really handle the case where a peer's interface has a different public vs. private IP address. It is much more common for HPC environments to have shared IP address spaces.

@feacluster
Copy link

I too ran into this bizarre error, "WARNING: Open MPI accepted a TCP connection from what appears to be another Open MPI process but cannot find a corresponding process".

After spending days of googling and lots of back and forth with the vendor, I found the root cause. From:

https://www.mail-archive.com/users@lists.open-mpi.org/msg34182.html

That typically occurs when some nodes have multiple interfaces, and
several nodes have a similar IP on a private/unused interface.

So there was a virbr0 network on captainmarvel03 and captaimarvel01 with the same IP address. Per:

https://forums.centos.org/viewtopic.php?t=61634

I disabled and stopped the libvirtd service and rebooted. Now everything works like before. The problem happened after I had installed GNOME desktop on captainmarvel03..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants