Skip to content

horovod distributed failed run in k8s with flannel network #6103

@zwqjoy

Description

@zwqjoy

Thank you for taking the time to submit an issue!

Background information

K8s 1.10 with flannel network
image: uber/horovod:0.15.1-tf1.11.0-torch0.4.1-py3.5

run local: success
mpirun -np 2 -H 127.0.0.1:2 --mca btl ^tcp -mca btl_base_verbose 100 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python keras_mnist_advanced.py

run distributed: failed
mpirun -np 2 -H 127.0.0.1:1,172.17.91.2 --mca btl ^tcp -mca btl_base_verbose 100 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH python keras_mnist_advanced.py

image

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

root@hvd-deployment-5c99f79c66-fwvjh:/examples# mpirun --version
mpirun.real (OpenRTE) 3.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

  • Operating system/version:
    image: Ubuntu 16.03
  • Computer hardware:
    K8S 1.10
  • Network type:
    Flannel Network

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions