-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can horovod run distributed in muti containers on the same nodes #451
Comments
@tingweiwu, you don't have to run multiple containers on the same host, you can simply run multiple processes within the same container:
|
@alsrgv Thank you for your reply. Yes, I know I can run multiple processes within the same container. is this case horovod support? |
@tingweiwu, I see. You can achieve this by running ssh server on a unique port in each container, and making a special ssh config in
You'd need to orchestrate port allocation somehow. |
@alsrgv Many thanks for your replay. I have tried to use I wonder that in Kubernetes with |
In general, host networking can be quite a bit faster than the default virtualized networking if the underlying hardware is fast enough. I don't have data on the performance points where it starts to matter a lot, but you could try to run an experiment on your hardware and share results :-) cc @rongou, does MPI operator support / plan to support host networking? |
@alsrgv OK, I will test the
Additionally, in Kubernetes cluster. we prefer not to to use |
No current plan to support it specifically. You might be able to set |
@rongou I have tried to set |
I think you are comparing apples and oranges. Docker is just a container engine, by itself it doesn't help with orchestration; kubernetes and the mpi operator together make it easier to launch multi-node training jobs. As for networking, all these plugins are not that different. Of course, if you have an HPC setup with infinibands or RoCE, that's a different story. |
@rongou Maybe I didn’t express that clearly. When I use kubernetes and the mpi operator together to launch multi-node training jobs. I think it use virtualized network to communicate(If I misunderstood this, I hope you correct me). I want to know how much it impacts the performance of horovod in this case, assume that the underlying hardware is fast enough, or may as you said we are using infinibands |
Right, that document assumes you are using docker only, without the help of kubernetes, so you'd have to ssh into each machine and invoke In a "production" environment where you have many users submitting lots of training jobs on a large cluster, you need some kind of cluster management or container orchestration tool. Kubernetes is one such tool. The MPI Operator just makes it slightly easier to run MPI jobs on multiple machines. With the convenience of Kubernetes, the virtualized networking may introduce some overhead, but exactly how much depends on a lot of factors. For example, https://typhoon.psdn.io/topics/performance/#network-performance shows the difference can be a few percent to more than half. In an ideal world, we'd like the power and convenience of Kubernetes, and the performance of bare metal, but I don't think we are quite there yet. In the end it's a trade off between many factors, only you can answer what makes the most sense to you. |
@rongou Thanks a lot for your reply. now I think I have got it |
I means that I run two containers on the same nodes.
As host network is used the two containers ip is the same as nodeIp
can I run this in first containers
and run this in sencond containers?
The text was updated successfully, but these errors were encountered: