Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can horovod run distributed in muti containers on the same nodes #451

Closed
tingweiwu opened this issue Aug 17, 2018 · 12 comments
Closed

can horovod run distributed in muti containers on the same nodes #451

tingweiwu opened this issue Aug 17, 2018 · 12 comments
Labels

Comments

@tingweiwu
Copy link

I means that I run two containers on the same nodes.
As host network is used the two containers ip is the same as nodeIp

horovod-1   1/1       Running    192.168.70.117  
horovod-2   1/1       Running    192.168.70.117

can I run this in first containers

mpirun -np 2 -H 192.168.70.117:1,192.168.70.117:1 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=eth0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python tensorflow_mnist.py

and run this in sencond containers?

 bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
@alsrgv
Copy link
Member

alsrgv commented Aug 17, 2018

@tingweiwu, you don't have to run multiple containers on the same host, you can simply run multiple processes within the same container:

mpirun -np 2 -H 192.168.70.117:2 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=eth0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python tensorflow_mnist.py

@tingweiwu
Copy link
Author

tingweiwu commented Aug 18, 2018

@alsrgv Thank you for your reply. Yes, I know I can run multiple processes within the same container.
I confused this case as follow:
In a Kubernetes cluster, it shedules pod (which contain container) among a cluster hosts.
As pod is Kubernetes sheduled unit. Assume that each host has 4 Gpus. And I want to run a distributed horovod example like tensorflow_mnist.py and set -np 4 -H server1:2,server1:2. that means I want 2 pod each need 2 Gpus. so Kubernetes master may sheduled two pod on the same host and pod is set network=host, that cause server1 and server2 is the same.

is this case horovod support?
Looking forward to your reply again

@alsrgv
Copy link
Member

alsrgv commented Aug 20, 2018

@tingweiwu, I see. You can achieve this by running ssh server on a unique port in each container, and making a special ssh config in /root/.ssh/config, like this:

Host worker-0
    HostName 10.191.34.22
    Port 31022
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    LogLevel quiet
Host worker-1
    HostName 10.191.34.26
    Port 31014
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    LogLevel quiet
...

You'd need to orchestrate port allocation somehow.

@tingweiwu
Copy link
Author

tingweiwu commented Aug 20, 2018

@alsrgv Many thanks for your replay. I have tried to use MPI Operator as the README.md suggested.
and I have run the tensorflow_mnist.py examples successfully.

I wonder that in Kubernetes with MPI Operator. whether horovod's performance is the same as the host-network container mode?

image

@alsrgv
Copy link
Member

alsrgv commented Aug 20, 2018

In general, host networking can be quite a bit faster than the default virtualized networking if the underlying hardware is fast enough. I don't have data on the performance points where it starts to matter a lot, but you could try to run an experiment on your hardware and share results :-)

cc @rongou, does MPI operator support / plan to support host networking?

@tingweiwu
Copy link
Author

tingweiwu commented Aug 20, 2018

@alsrgv OK, I will test the tensorflow_mnist.py examples both in host-network pod and MPI Operator later and share the results.
Can I use HOROVOD_TIMELINE as metioned in this doc to compare the two case performance beflow ?

https://github.com/uber/horovod/blob/master/docs/timeline.md

Additionally, in Kubernetes cluster. we prefer not to to use host network as it brings about port conflicts or we may need to orchestrate port allocation somehow as you said.

@rongou
Copy link
Contributor

rongou commented Aug 20, 2018

No current plan to support it specifically. You might be able to set hostNetwork: true in your podspec and see if that works.

@tingweiwu
Copy link
Author

@rongou I have tried to set hostNetwork: true, it works. But I prefer not to use host network in kubernetes as port conflicts and other issues.
Now I cann't do performance test right for some reasons. In my kubernetes I use calico as my CNI plugin, I researched before and found calico performance is nearly to host network compared to other docker virtualized networking.
@alsrgv So, I still have a question. As MPI Operator already provided mpi operation in kubernetes, and also I have run tensorflow_mnist.py example in horovod successlly. Except the network that host is generally faster than virtualized as you said before, is there any other probable performance issue if I use MPI Operator in kubernetes rather than host as your Horovod in Docker doc suggested?

@rongou
Copy link
Contributor

rongou commented Aug 21, 2018

I think you are comparing apples and oranges. Docker is just a container engine, by itself it doesn't help with orchestration; kubernetes and the mpi operator together make it easier to launch multi-node training jobs.

As for networking, all these plugins are not that different. Of course, if you have an HPC setup with infinibands or RoCE, that's a different story.

@tingweiwu
Copy link
Author

tingweiwu commented Aug 21, 2018

@rongou Maybe I didn’t express that clearly.
What confused me is that horovod recommends host network directly from its Horovod in Docker
image

When I use kubernetes and the mpi operator together to launch multi-node training jobs. I think it use virtualized network to communicate(If I misunderstood this, I hope you correct me). I want to know how much it impacts the performance of horovod in this case, assume that the underlying hardware is fast enough, or may as you said we are using infinibands

@rongou
Copy link
Contributor

rongou commented Aug 21, 2018

Right, that document assumes you are using docker only, without the help of kubernetes, so you'd have to ssh into each machine and invoke docker run manually, or through some shell scripts. It gives you the flexibility of using host networking, but it doesn't really scale beyond a handful of users sitting in the same room (or you can use a spreadsheet to keep track of who's using what).

In a "production" environment where you have many users submitting lots of training jobs on a large cluster, you need some kind of cluster management or container orchestration tool. Kubernetes is one such tool. The MPI Operator just makes it slightly easier to run MPI jobs on multiple machines.

With the convenience of Kubernetes, the virtualized networking may introduce some overhead, but exactly how much depends on a lot of factors. For example, https://typhoon.psdn.io/topics/performance/#network-performance shows the difference can be a few percent to more than half.

In an ideal world, we'd like the power and convenience of Kubernetes, and the performance of bare metal, but I don't think we are quite there yet. In the end it's a trade off between many factors, only you can answer what makes the most sense to you.

@tingweiwu
Copy link
Author

@rongou Thanks a lot for your reply. now I think I have got it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants