Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something unexpected in timeline #316

Closed
roastduck opened this issue Jun 16, 2018 · 22 comments
Closed

Something unexpected in timeline #316

roastduck opened this issue Jun 16, 2018 · 22 comments

Comments

@roastduck
Copy link

I think each Allreduce op in the timeline should have exactly 2 lines: NEGOTIATE_ALLREDUCE or ALLREDUCE above, and specific operations below. But I got 4 lines from time to time. See the image.

image

There's a NEGOTIATE_ALLREDUCE phase inside ALLREDUCE phase. Is it normal or a bug? Running with v0.13.4.

@alsrgv
Copy link
Member

alsrgv commented Jun 16, 2018

@roastduck, this is quite interesting. Could you share your timeline? I'm surprised to see next negotiation to start before the last iteration ended, are you trying to do 1-late SGD or something along those lines?

@roastduck
Copy link
Author

timeline_decomp.zip

Here's the timeline file.

I'm doing data parallelism with strong consistency, not 1-late SGD.

Please don't be confused with that v0/tower_0/HorovodAllreduce_v0_tower_0_HorovodAllgather_v0_tower_0_cg_Reshape_0_0 op. I'm just doing allgather using allreduce (padding 0 then perform allreduce), because there's no NCCL support for allgather.

@roastduck
Copy link
Author

Plus: I merged fp16_divide_before_sum branch to use fp16 allreduce.

@alsrgv
Copy link
Member

alsrgv commented Jun 19, 2018

@roastduck, do you get better performance with this padded allreduce than a simple MPI allgather that is supported? Since allgather is focused on data transmission rather than computing sum, I think there should be less benefit of running it on GPU.

@roastduck
Copy link
Author

As described in docs/running.md, if I use RDMA for allreduce (with NCCL), I cannot use RDMA for allgather (with MPI) then. So I think this padded allreduce is better, because it's using RDMA. Am I right?

@alsrgv
Copy link
Member

alsrgv commented Jun 19, 2018

@roastduck, it's not entirely true - while MPI will not do GPU-to-GPU RDMA, it can still do GPU->CPU->RDMA->CPU->GPU which is still offloading the data transfer to NIC. Which MPI & NIC are you using?

Regardless, you are still likely to get a lot of benefit from using allgather because of its simplicity. Did you try to compare the performance of both approaches?

@roastduck
Copy link
Author

@alsrgv, padded allreduce does gain some performance, although it's not significant.

@roastduck
Copy link
Author

roastduck commented Jun 20, 2018

@alsrgv, I'm confused with the word

-mca pml ob1 and -mca btl ^openib flags force the use of TCP for MPI communication. This avoids many multiprocessing issues that Open MPI has with RDMA which typically result in segmentation faults. Using TCP for MPI does not have noticeable performance impact since most of the heavy communication is done by NCCL, which will use RDMA via RoCE or InfiniBand if they're available

in docs/running.md. Is MPI truly uses RDMA in this scenario?

By the way, I think this padded allreduce is just a workaround. Using native allgather in NCCL should be the best solution. I found snuspl's fork tried to add NCCL allgather support. I tested it, but it's very slow however.

@alsrgv
Copy link
Member

alsrgv commented Jun 20, 2018

@roastduck, can you share how much is padded NCCL allreduce over RDMA numerically better than MPI allgather over TCP? 5%, 10%, 50%?

Instructions we have for Open MPI indeed disable RDMA for MPI. The reason for that is that default OpenIB BTL doesn't work with multi-threading. I've just pushed a branch / PR that includes a flag that allows you to disable threading: #324

If you merge that PR in your working copy, you can use MPI RDMA by removing -mca btl ^openib and adding -x HOROVOD_MPI_THREADS_DISABLE=1. Can you try that and report whether that improves MPI allgather performance?

@roastduck
Copy link
Author

roastduck commented Jun 20, 2018

@alsrgv, Sorry for saying padded allreduce gains some performance, which was wrong. This result came from testing a complete DNN model, where there are too many influence factors for performance.

This time, I did a dedicated test for allgather. I performed allgather on 10^9 float32 items, using EDR IB. See below:

Method Negotiation (ms) Transmission (ms) Total (ms)
Original 1121.820 12314.193 13436.013
Padded allreduce 1641.637 21470.863 23112.500
PR #324 1049.802 4173.146 5222.948

Padded allreduce is actually slower, and MPI RDMA is much faster. However, when training my own model, padded allreduce is truly faster (I tested it again) :(

@alsrgv
Copy link
Member

alsrgv commented Jun 20, 2018

@roastduck, how many allgathers did you do for this test? The first operation can have an overhead of setting things up, especially in case of NCCL. Can you quantify how much your own model training is faster with padded allreduce?

@roastduck
Copy link
Author

roastduck commented Jun 21, 2018

@alsrgv, the last result came from testing only 1 allgather. Now I performed a new test which allgather 10^8 float32 10 times, using the same environment.

Method Total negotiation + transmission time (ms)
Original 14870.601
Padded allreduce 38593.184
PR #324 3036.732

As for my own model, I cannot get the exact timing because the timeline result is not stable, and there's the strange behavior we discussed in this issue. I evaluate the performance with the sample throughput. The padded allreduce method increases the throughput from around 190 sample/sec to around 200 sample/sec (in one of the complex configurations).

@alsrgv
Copy link
Member

alsrgv commented Jun 21, 2018

@roastduck, can you start measuring time after the first iteration completes? Otherwise, variability during startup can cause these numbers to be very random.

You can also try playing with -x HOROVOD_CYCLE_TIME=[...] environment variable which dictates delay between Horovod cycles, as described in https://github.com/uber/horovod/blob/master/docs/tensor-fusion.md. The default value is 5, which means 5 millisecond delay. I suspect that padded allreduce may simply make one of the cycles longer. You can try increasing and decreasing the value to find potentially better performance.

@roastduck
Copy link
Author

roastduck commented Jun 21, 2018

@alsrgv, I made a mistake. I forgot to enable RDAM in MPI before performing the 3rd test above. Now it's fixed.

Here's the timing of the last 9 of the 10 allgathers in previous experiments. It doesn't vary much.

Method Total negotiation + transmission time (ms)
Original 12266.511
Padded allreduce 34251.899
Padded allreduce (fusion disabled) 27125.241
PR #324 2465.120

Since there is actually no fusion happened in this test case (10^8 float32), I disabled fusion, and then the time dropped from 34251.899 ms to 27125.241 ms.

I did the measurement by summing up all the bars in the timeline result.

@roastduck
Copy link
Author

Since you mentioned fusion, I guess my model might benefit from fuse the allgathers, which are actually padded allreduces, with other normal allreduces. I observed that this type of fusion really happened. This might explain why padded allreduce performs better in my model.

@alsrgv
Copy link
Member

alsrgv commented Jun 21, 2018

@roastduck, ah - these new numbers are much better. What if you try disabling fusion for your real model, do you see performance degradation of the padded allreduce? Sounds like we should add fused allgather.

@roastduck
Copy link
Author

@alsrgv, in my model, there're 2 large allgathers, and many small allreduces. It doesn't make sense to disable fusing for my model, because those allreduces will surely lead to a performance degradation.

I'd like to see fused allgather. But if allgathers cannot be fused with allreduces, it won't be beneficial for me.

@alsrgv
Copy link
Member

alsrgv commented Jun 21, 2018

@roastduck, sounds like what you're doing is best for your model :-) You can still play with various cycle time settings to find the best value for your model.

@anpark
Copy link

anpark commented Aug 4, 2018

very interesting discuss.

why not try ncclallgather? i'll try it #252
@roastduck @alsrgv

@roastduck
Copy link
Author

@anpark, I found snuspl's fork tried to add NCCL allgather support, but it doesn't perform well. You may take a look.

@anpark
Copy link

anpark commented Aug 5, 2018

@roastduck i'm doing this now.

by timeline i found two big performance problem:

  1. new/delete for large buffer not efficiency, for example allgather, 300M buffer new/delete will cost much time, so should use buffer pool
  2. mpi_allgather should change to ncclallgather, in my test, mpi_allgather too slow for large buffer

in my test, large tensor allgather is too slow, and is the bottleneck, could you give some advice?
i have 16 gpus, every batch we have 41000 * 128 float for each gpu to allgather, this comes from large emebedding lookup (gradient is IndexedSlices, so use allgather) , so one gpu will get 15 X 128 X 41000 X 4 bytes = 305M+(gpu-cpu-network-cpu-gpu) but if we scale to 32 gpus, 64 gpus, maybe grow to 3G ? maybe much slower?

one more thing:

cudaMallocHost will use DMA without cpu, so cudaMemoryCpy is not needed?

@alsrgv
Copy link
Member

alsrgv commented Aug 9, 2018

Let's discuss allgather optimizations in this thread: #430.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants