-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to Deploy and Run a Horovod framework with Mellanox Network(ROCE and Lossless queue) #654
Comments
Horovod uses MPI and NCCL for its networking. RDMA RoCE is using queue 5 by default. So as long as you can configure MPI and NCCL to use RDMA RoCE, it should solve your problem. For OpenMPI, here is the link how to use RoCE: https://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce For NCCL, the related options are -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_bond_0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_CUDA_SUPPORT=1 The full command line I used for openmpi and NCCL is |
@LiweiPeng thinks |
when run docker,should set container privilege or mount the device driver; |
@LiweiPeng Does Horovod only support RoCE? Is it possible to run it with rdma protocol? |
I remove the |
@jackalcooper, let's discuss in #1097. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I have some servers with Mellanox DEVICE, I use ROCE, and set up a lossless queue, run the service in docker, and use the hostnetwork network mode, how to use mpi to run the horovod job, let the traffic enter the lossless queue (queue is 5)
The text was updated successfully, but these errors were encountered: