What "server did not get guid" means? #8257

egorgam · 2020-11-25T23:11:59Z

Hello!

This issue is just a question: what means "server did not get guid" in this error list? https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/tcp/help-mpi-btl-tcp.txt#L116 I got it when I try to run mpirun from master node with two workers (which same IP address in eth0 interface). In my case workers are behind NAT, and they connected to master node with ssh -R tunnel (master is going to workers with ssh by aliases)

The text was updated successfully, but these errors were encountered:

jsquyres · 2020-11-26T15:20:28Z

Looking (briefly) at horovod/horovod#2477, if all your VMs/containers/nodes/whatevers have the same IP address, that's going to be a problem. Open MPI relies on having different IP addresses -- e.g., it looks like Open MPI may have tried to open IP address X from IP address X, and therefore it connected to itself (and then things went downhill from there).

If you can provide each VM/container/node/whatever with a unique alias IP address and then use that set of IP addresses for Open MPI, that might work better. You can tell Open MPI to exclude the everyone-has-the-same-172.28.0.2 IP addresses via:

$ mpirun --mca btl_tcp_if_exclude 172.28.0.2/32 ...

Or even just tell Open MPI exactly which subnet to use (the "include" and "exclude" directives are mutually exclusive):

# Assuming you make unique IP aliases in the 10.1.2.0/24 range somewhere.
$ mpirun --mca btl_tcp_if_include 10.1.2.0/24 ...

egorgam · 2020-11-30T00:57:04Z

@jsquyres Thanks for answer! Unfortunately in this case I haven't permissions to create aliases (RTNETLINK answers: Operation not permitted). So is there any way to force OpenMPI use hostname as identifier of node?

jsquyres · 2020-11-30T14:01:05Z

Open MPI can and does use network names, but it always resolves them to IP addresses first, and then uses the resulting IP address for the unique network address of that peer. Sorry.

egorgam · 2020-11-30T16:38:00Z

I have an idea - I can build opempi from sources with hardcoded host ip address which using in that infrastructure (for each node). And it may work, because there are same ssh aliases in all nodes. So where I can found something like host address definition in openmpi code?

jsquyres · 2020-11-30T17:20:51Z

I don't know if I fully understand what you mean.

You can certainly make a /etc/hosts file with multiple different hostnames and different IP addresses, where the IP addresses correspond to the public IP address for remote peers (I'm assuming that each peer has the same private IP address but different public IP addresses, where "public" is loosely defined as "public within the scope of your horovod run"). You'll still need to plumb through various ports between the public and private IP interfaces, too (e.g., Open MPI makes TCP sockets on effectively random port numbers between peers -- you can control those socket port number ranges, if you want, but it's more setup to do).

But even with that, I don't know if that will be enough.

Open MPI uses the IP address as a unique identifier for the peer to know that it has contacted a) the correct peer, and b) on the interface that it expected. Meaning: even if you write a per-host /etc/hosts file with the public IP addresses of the peers, Open MPI will still fail the IP address check during the connection handshake. This check was put in place because we had many cases where users had either erroneous or complicated networking environments which let to Open MPI either contacting the wrong peer or the wrong interface on the right peer, and then Open MPI would get confused and eventually the job would fail. We therefore put in the "make sure we've connected to the right peer and the right interface" check to prevent these kinds of problems from occurring.

Generally speaking, the TCP BTL doesn't really handle the case where a peer's interface has a different public vs. private IP address. It is much more common for HPC environments to have shared IP address spaces.

feacluster · 2022-12-22T16:36:47Z

I too ran into this bizarre error, "WARNING: Open MPI accepted a TCP connection from what appears to be another Open MPI process but cannot find a corresponding process".

After spending days of googling and lots of back and forth with the vendor, I found the root cause. From:

https://www.mail-archive.com/users@lists.open-mpi.org/msg34182.html

That typically occurs when some nodes have multiple interfaces, and
several nodes have a similar IP on a private/unused interface.

So there was a virbr0 network on captainmarvel03 and captaimarvel01 with the same IP address. Per:

https://forums.centos.org/viewtopic.php?t=61634

I disabled and stopped the libvirtd service and rebooted. Now everything works like before. The problem happened after I had installed GNOME desktop on captainmarvel03..

egorgam mentioned this issue Nov 25, 2020

Open MPI process cannot find a corresponding process entry for that peer. horovod/horovod#2477

Closed

jsquyres added the question label Nov 26, 2020

egorgam closed this as completed Nov 30, 2020

egorgam reopened this Nov 30, 2020

egorgam closed this as completed Nov 30, 2020

cwsmith mentioned this issue Oct 14, 2022

Problem running multi-node MPI jobs access-ci-org/Jetstream_Cluster#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What "server did not get guid" means? #8257

What "server did not get guid" means? #8257

egorgam commented Nov 25, 2020

jsquyres commented Nov 26, 2020

egorgam commented Nov 30, 2020

jsquyres commented Nov 30, 2020

egorgam commented Nov 30, 2020

jsquyres commented Nov 30, 2020

feacluster commented Dec 22, 2022

What "server did not get guid" means? #8257

What "server did not get guid" means? #8257

Comments

egorgam commented Nov 25, 2020

jsquyres commented Nov 26, 2020

egorgam commented Nov 30, 2020

jsquyres commented Nov 30, 2020

egorgam commented Nov 30, 2020

jsquyres commented Nov 30, 2020

feacluster commented Dec 22, 2022