Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun failure on 4 nodes #6107

Closed
regel opened this issue Nov 24, 2018 · 6 comments
Closed

mpirun failure on 4 nodes #6107

regel opened this issue Nov 24, 2018 · 6 comments

Comments

@regel
Copy link

regel commented Nov 24, 2018

Background information

Running horovod/open-mpi in a cluster with multiple nodes. All nodes are declared in /etc/hosts, and can properly SSH to each other.

mpirun is OK using 3 nodes, but KO using 4 nodes regardless of which nodes are chosen. All 4 nodes have identical hardware, OS version, installed packages, and are running in the same data-center.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

rpm -qa | grep openmpi
openmpi-1.10.7-1.el7.x86_64

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

yum install

Please describe the system on which you are running

  • Operating system/version: Cent0S7
  • Computer hardware: Google Cloud Compute
  • Kernel: Linux horovod-2 3.10.0-862.14.4.el7.x86_64 BTL checkpoint friendly #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Machine type: n1-standard-4 (4 vCPUs, 15 GB memory)
  • CPU platform: Intel Sandy Bridge
  • Zone: europe-west1-b
  • Network type: Ethernet

Details of the problem

mpirun is OK using 3 nodes, but KO using 4 nodes regardless of which nodes are chosen.

shell$ mpirun --host horovod-1.local,horovod-2.local,horovod-3.local hostname
horovod-1
horovod-2
horovod-3
shell$ mpirun --host horovod-1.local,horovod-2.local,horovod-4.local hostname
horovod-1
horovod-2
horovod-4
shell$ mpirun --host horovod-1.local,horovod-2.local,horovod-3.local,horovod-4.local hostname
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
@rhc54
Copy link
Contributor

rhc54 commented Nov 24, 2018

Your problem is right here:

Host key verification failed.

You need to set the ssh key on one of your nodes

@regel
Copy link
Author

regel commented Nov 25, 2018

Hi Ralph,

I verified that all hosts can SSH to each other:

[regel@horovod-1 ~]$ ssh horovod-1.local
Last login: Sun Nov 25 07:10:43 2018 from horovod-4.local
[regel@horovod-1 ~]$ logout
Connection to horovod-1.local closed.
[regel@horovod-1 ~]$ ssh horovod-2.local
Last login: Sun Nov 25 07:10:47 2018 from horovod-4.local
[regel@horovod-2 ~]$ logout
Connection to horovod-2.local closed.
[regel@horovod-1 ~]$ ssh horovod-3.local
Last login: Sat Nov 24 17:29:09 2018 from horovod-4.local
[regel@horovod-3 ~]$ logout
Connection to horovod-3.local closed.
[regel@horovod-1 ~]$ ssh horovod-4.local
Last login: Sat Nov 24 17:29:17 2018 from horovod-4.local

With one node, or a permutation of three nodes, mpirun is ok:

[regel@horovod-1 ~]$ mpirun --host horovod-1.local hostname
horovod-1
[regel@horovod-1 ~]$ mpirun --host horovod-2.local hostname
horovod-2
[regel@horovod-1 ~]$ mpirun --host horovod-3.local hostname
horovod-3
[regel@horovod-1 ~]$ mpirun --host horovod-4.local hostname

The issue is only reproduced when I'm using four nodes in the mpirun command.

@rhc54
Copy link
Contributor

rhc54 commented Nov 25, 2018

The problem is that you cannot ssh from one of those nodes to another node. mpirun uses a tree-like launch pattern. You need to be able to ssh from (for example) horovod-3 to horovod-4 (and the other combinations) as well.

@jsquyres
Copy link
Member

jsquyres commented Dec 1, 2018

@regel Haven't heard back from you in a few days, so I'm going to assume Ralph's answer was the correct one. Feel free to ping back here if you need more help.

@jsquyres jsquyres closed this as completed Dec 1, 2018
@algorithmconquer
Copy link

@regel Do you solve the problem?

@erolrecep
Copy link

Yes, I solved the issue. You need to cross check each and every node with each other. For instance, you have rpi01, rpi02, rpi03, and rpi04 nodes. When you run;

pi@rpi01 $ ssh rpi01 (should login)
pi@rpi02 $ ssh rpi02 (should login)
pi@rpi03 $ ssh rpi03 (should login)
pi@rpi04 $ ssh rpi04 (should login)

also,

pi@rpi01 $ ssh rpi02 (should login)
pi@rpi02 $ ssh rpi04 (should login)
pi@rpi03 $ ssh rpi01 (should login)
pi@rpi04 $ ssh rpi03 (should login)

If any of these give any issue or ask again with ...(yes/no)? prompt which means the connections are not properly set them up.

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants