-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why pytorch distributed training on two servers is slower than training on one server #169
Comments
@Tomcli I run the example pytorch-launch-dist on two servers using 3 learners (but there just has 2 Processes), 1 learner is on one server, and 2 learners are both on another server, but the log just has "node_rank=0" and "node_rank=1", there has no "node_rank=2", it is the same as my issue above. |
Hi @Eric-Zhang1990 , It could be something has timed out when you initiating your process group due to low bandwidth. What protocol did you use for initiating your process group? Is it TCP, GLOO, or NCCL? |
@Tomcli I have tried to use NCCL and GLOO, both of them have the same result above. |
@Tomcli Now we use 1000M bandwidth, it can improve the training speed. |
@Tomcli Now we use 1000M bandwidth, I do a test for comparing the speed between using FfDL and just using system env (using conda) on the same server, I find that on the same setting, the training speed of just using system env is faster than that of using FfDL. I don't know why, do you know why? Thank you. |
There is cost to be paid for bandwidth if you are spreading across two servers. What would be interesting is to see if both GPUs are on same server, it it faster than system env or not? Also the training has to be distributed across multiple GPUs I would say to override the payoff for bandwidth and extra communication overhead |
@animeshsingh I have done some tests for comparing the speed between using FfDL and just using system env (using conda) on the same server, followings are the comparation:
|
@Tomcli @animeshsingh I am running the project maskrcnn-benchmark (https://github.com/facebookresearch/maskrcnn-benchmark) on 3 nodes, 16 gpus(2 nodes have 4 gpus respectively, 1 node has 8 gpus). What I want to ask is that original code uses "torch.distributed.reduce()", but I am on multi nodes and multi GPUs, what should I use "torch.distributed.reduce_multigpu" or "torch.distributed.all_reduce_multigpu" or someone else? And will they affect the training speed? Because my training speed is very slow when I use 3 nodes and 16 gpus, just like comparation above. |
Thanks @Eric-Zhang1990 for testing this thoroughly. While using bare metal directly without the overhead of containers if we have the same number of GPUs, the speed is going to be faster on bare metal. Going behind the concept of FfDL, the idea is to distribute training over multiple containers and the fact that these containers can be spawned and killed on demand. This allows multiple users to share the same hardware backend environment, and then be able to provide capabilities like batch scheduling, job queuing, moitoring etc which we are working towards adding by integrating with kube-batch. The users dont need to login to individual machines, set things up etc., and are offered this as a service. Also the fact that the user journey remains the same whether they are using PyTorch of Tensorflow etc. |
It definitely doesn`t seem normal, and we would like to simulate at our end and test more. Are these two GPUs in the first case on same machine, and while doing 4 GPUs we spread across two machines? |
I would assume reduce should be faster, given that only the process with rank dst is going to receive the final result. |
Thanks @animeshsingh for kind reply. I test them on same machine (this machine has 4 gpus). I also think it is not normal, but I don't know which reason will cause this phenomenon. Just like I describe above, when I use 3 nodes, 16 gpus (16 learners on 3 machines), it is very slower than 4 learners (4 learners on same machine). |
@animeshsingh @Tomcli When I see the doc "FfDL/docs/gpu-guide.md" again, it says that we should use "helm install --set lcm.device_plugin=false ." to deploy FfDL, but I didn't use parameter "lcm.device_plugin=false", does it affect the training speed? |
On the same machine, more GPUs should definitely be faster. When going across machines, it depends on having the right combination for your hardware as described here |
@animeshsingh I use 'nccl' backend and 'reduce' which original maskrcnn-benchmark provides, and when I run maskrcnn-benchmark on 2 machines just using pytorch distributed training (not using FfDL), I can run correctly using backend 'gloo', but error occurs when using 'nccl', I am still trying to find solution. |
@animeshsingh Do you run the original maskrcnn-benchmark on FfDL? How about the speed between multi gpus on one machine and multi gpus on two or more machines? Thank you. |
Hi @Eric-Zhang1990, sorry for the late reply. Can you show us the commands and specs you used for running the maskrcnn-benchmark on FfDL? Is it similar to
And you specified 4 gpus and 3 learners? Thanks. |
@Tomcli My manifest .yml is similar to 'FfDL/etc/examples/pytorch-launch-dist/manifest.yml', |
@Tomcli I do a test, I run the same code using FfDL and pytorch's distributed training directly, the speed of pytorch's distributed training is almost 2 times faster than FfDL, both of them are using the same machines and GPUs. Is it because some micro services running on FfDL? |
@Eric-Zhang1990 Directly you are running on baremetal? |
@animeshsingh Yeah, I run the maskrcnn-benchmark using conda env, the code and parameters are all the same, the command I use is following (2 machines, each one has 4 gpus): |
@Tomcli I can run pytorch distributed training on multi nodes, but I find that speed of distributed training on two servers is slower than that on one server, I don't know why, can you help me see it deeply? Thank you.
I find that the number of "node_rank" is different between one server and two servers. One server has 2 "node_rank", but two servers just have 1 "node_rank".
This is result of training on one server. (2 gpus, each learner has one)
That is result of training on two servers. (2 gpus, each learner has one, but on two servers)
The text was updated successfully, but these errors were encountered: