Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to Deploy and Run a Horovod framework with Mellanox Network(ROCE and Lossless queue) #654

Closed
kinglion811 opened this issue Nov 26, 2018 · 8 comments

Comments

@kinglion811
Copy link

kinglion811 commented Nov 26, 2018

I have some servers with Mellanox DEVICE, I use ROCE, and set up a lossless queue, run the service in docker, and use the hostnetwork network mode, how to use mpi to run the horovod job, let the traffic enter the lossless queue (queue is 5)

@kinglion811 kinglion811 changed the title how run horovod how to Deploy and Run a Horovod framework with Mellanox Network(ROCE and Lossless queue) Nov 26, 2018
@kinglion811 kinglion811 reopened this Nov 26, 2018
@kinglion811
Copy link
Author

@alsrgv

@LiweiPeng
Copy link

Horovod uses MPI and NCCL for its networking. RDMA RoCE is using queue 5 by default. So as long as you can configure MPI and NCCL to use RDMA RoCE, it should solve your problem.

For OpenMPI, here is the link how to use RoCE: https://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce

For NCCL, the related options are -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_bond_0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_CUDA_SUPPORT=1

The full command line I used for openmpi and NCCL is
mpirun --machinefile m.txt -bind-to none -map-by slot -np 24 --mca btl_openib_if_include mlx5_bond_0 -x HOROVOD_MPI_THREADS_DISABLE=1 --mca mpi_warn_on_fork false --mca btl openib,self,smcuda --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 --mca btl_openib_cuda_async_recv false -x NCCL_CHECKS_DISABLE=1 -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_bond_0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_CUDA_SUPPORT=1

@kinglion811
Copy link
Author

@LiweiPeng thinks

@kinglion811
Copy link
Author

when run docker,should set container privilege or mount the device driver;
https://community.mellanox.com/docs/DOC-2971#jive_content_id_Install_Docker_CE

@jackalcooper
Copy link

@LiweiPeng Does Horovod only support RoCE? Is it possible to run it with rdma protocol?

@jackalcooper
Copy link

I remove the NCCL_IB_GID_INDEX variable in my script and it hangs after NCCL logs output.

@alsrgv
Copy link
Member

alsrgv commented May 26, 2019

@jackalcooper, let's discuss in #1097.

@stale
Copy link

stale bot commented Nov 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 7, 2020
@stale stale bot closed this as completed Nov 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants