New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Something unexpected in timeline #316
Comments
@roastduck, this is quite interesting. Could you share your timeline? I'm surprised to see next negotiation to start before the last iteration ended, are you trying to do 1-late SGD or something along those lines? |
Here's the timeline file. I'm doing data parallelism with strong consistency, not 1-late SGD. Please don't be confused with that |
Plus: I merged |
@roastduck, do you get better performance with this padded allreduce than a simple MPI allgather that is supported? Since allgather is focused on data transmission rather than computing sum, I think there should be less benefit of running it on GPU. |
As described in |
@roastduck, it's not entirely true - while MPI will not do GPU-to-GPU RDMA, it can still do GPU->CPU->RDMA->CPU->GPU which is still offloading the data transfer to NIC. Which MPI & NIC are you using? Regardless, you are still likely to get a lot of benefit from using allgather because of its simplicity. Did you try to compare the performance of both approaches? |
@alsrgv, padded allreduce does gain some performance, although it's not significant. |
@alsrgv, I'm confused with the word
in By the way, I think this padded allreduce is just a workaround. Using native allgather in NCCL should be the best solution. I found snuspl's fork tried to add NCCL allgather support. I tested it, but it's very slow however. |
@roastduck, can you share how much is padded NCCL allreduce over RDMA numerically better than MPI allgather over TCP? 5%, 10%, 50%? Instructions we have for Open MPI indeed disable RDMA for MPI. The reason for that is that default OpenIB BTL doesn't work with multi-threading. I've just pushed a branch / PR that includes a flag that allows you to disable threading: #324 If you merge that PR in your working copy, you can use MPI RDMA by removing |
@alsrgv, Sorry for saying padded allreduce gains some performance, which was wrong. This result came from testing a complete DNN model, where there are too many influence factors for performance. This time, I did a dedicated test for allgather. I performed allgather on 10^9 float32 items, using EDR IB. See below:
Padded allreduce is actually slower, and MPI RDMA is much faster. However, when training my own model, padded allreduce is truly faster (I tested it again) :( |
@roastduck, how many allgathers did you do for this test? The first operation can have an overhead of setting things up, especially in case of NCCL. Can you quantify how much your own model training is faster with padded allreduce? |
@alsrgv, the last result came from testing only 1 allgather. Now I performed a new test which allgather 10^8 float32 10 times, using the same environment.
As for my own model, I cannot get the exact timing because the timeline result is not stable, and there's the strange behavior we discussed in this issue. I evaluate the performance with the sample throughput. The padded allreduce method increases the throughput from around 190 sample/sec to around 200 sample/sec (in one of the complex configurations). |
@roastduck, can you start measuring time after the first iteration completes? Otherwise, variability during startup can cause these numbers to be very random. You can also try playing with |
@alsrgv, I made a mistake. I forgot to enable RDAM in MPI before performing the 3rd test above. Now it's fixed. Here's the timing of the last 9 of the 10 allgathers in previous experiments. It doesn't vary much.
Since there is actually no fusion happened in this test case (10^8 float32), I disabled fusion, and then the time dropped from 34251.899 ms to 27125.241 ms. I did the measurement by summing up all the bars in the timeline result. |
Since you mentioned fusion, I guess my model might benefit from fuse the allgathers, which are actually padded allreduces, with other normal allreduces. I observed that this type of fusion really happened. This might explain why padded allreduce performs better in my model. |
@roastduck, ah - these new numbers are much better. What if you try disabling fusion for your real model, do you see performance degradation of the padded allreduce? Sounds like we should add fused allgather. |
@alsrgv, in my model, there're 2 large allgathers, and many small allreduces. It doesn't make sense to disable fusing for my model, because those allreduces will surely lead to a performance degradation. I'd like to see fused allgather. But if allgathers cannot be fused with allreduces, it won't be beneficial for me. |
@roastduck, sounds like what you're doing is best for your model :-) You can still play with various cycle time settings to find the best value for your model. |
very interesting discuss. why not try ncclallgather? i'll try it #252 |
@anpark, I found snuspl's fork tried to add NCCL allgather support, but it doesn't perform well. You may take a look. |
@roastduck i'm doing this now. by timeline i found two big performance problem:
in my test, large tensor allgather is too slow, and is the bottleneck, could you give some advice? one more thing: cudaMallocHost will use DMA without cpu, so cudaMemoryCpy is not needed? |
Let's discuss allgather optimizations in this thread: #430. |
I think each Allreduce op in the timeline should have exactly 2 lines: NEGOTIATE_ALLREDUCE or ALLREDUCE above, and specific operations below. But I got 4 lines from time to time. See the image.
There's a NEGOTIATE_ALLREDUCE phase inside ALLREDUCE phase. Is it normal or a bug? Running with v0.13.4.
The text was updated successfully, but these errors were encountered: