Skip to content

horovod distributed failed run in k8s with kube-router #6447

@wangqiaoshi

Description

@wangqiaoshi

Thank you for taking the time to submit an issue!

Background information

image:mpioperator/tensorflow-benchmarks:latest
k8s v1.8.2

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.2
root@ab8c7f1afca5:/tensorflow/benchmarks# mpirun --version
mpirun.real (OpenRTE) 3.1.2

Report bugs to http://www.open-mpi.org/community/help/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.6.1810 (Core)
  • Computer hardware: nvidia p40
  • Network type: bgp kube-router

The situation is similar to this #6103

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions