Something unexpected in timeline #316

roastduck · 2018-06-16T07:00:40Z

I think each Allreduce op in the timeline should have exactly 2 lines: NEGOTIATE_ALLREDUCE or ALLREDUCE above, and specific operations below. But I got 4 lines from time to time. See the image.

There's a NEGOTIATE_ALLREDUCE phase inside ALLREDUCE phase. Is it normal or a bug? Running with v0.13.4.

alsrgv · 2018-06-16T20:10:57Z

@roastduck, this is quite interesting. Could you share your timeline? I'm surprised to see next negotiation to start before the last iteration ended, are you trying to do 1-late SGD or something along those lines?

roastduck · 2018-06-17T02:35:19Z

timeline_decomp.zip

Here's the timeline file.

I'm doing data parallelism with strong consistency, not 1-late SGD.

Please don't be confused with that v0/tower_0/HorovodAllreduce_v0_tower_0_HorovodAllgather_v0_tower_0_cg_Reshape_0_0 op. I'm just doing allgather using allreduce (padding 0 then perform allreduce), because there's no NCCL support for allgather.

roastduck · 2018-06-17T02:39:19Z

Plus: I merged fp16_divide_before_sum branch to use fp16 allreduce.

alsrgv · 2018-06-19T06:46:45Z

@roastduck, do you get better performance with this padded allreduce than a simple MPI allgather that is supported? Since allgather is focused on data transmission rather than computing sum, I think there should be less benefit of running it on GPU.

roastduck · 2018-06-19T11:27:58Z

As described in docs/running.md, if I use RDMA for allreduce (with NCCL), I cannot use RDMA for allgather (with MPI) then. So I think this padded allreduce is better, because it's using RDMA. Am I right?

alsrgv · 2018-06-19T21:35:16Z

@roastduck, it's not entirely true - while MPI will not do GPU-to-GPU RDMA, it can still do GPU->CPU->RDMA->CPU->GPU which is still offloading the data transfer to NIC. Which MPI & NIC are you using?

Regardless, you are still likely to get a lot of benefit from using allgather because of its simplicity. Did you try to compare the performance of both approaches?

roastduck · 2018-06-20T02:21:36Z

@alsrgv, padded allreduce does gain some performance, although it's not significant.

roastduck · 2018-06-20T02:27:34Z

@alsrgv, I'm confused with the word

-mca pml ob1 and -mca btl ^openib flags force the use of TCP for MPI communication. This avoids many multiprocessing issues that Open MPI has with RDMA which typically result in segmentation faults. Using TCP for MPI does not have noticeable performance impact since most of the heavy communication is done by NCCL, which will use RDMA via RoCE or InfiniBand if they're available

in docs/running.md. Is MPI truly uses RDMA in this scenario?

By the way, I think this padded allreduce is just a workaround. Using native allgather in NCCL should be the best solution. I found snuspl's fork tried to add NCCL allgather support. I tested it, but it's very slow however.

alsrgv · 2018-06-20T08:04:07Z

@roastduck, can you share how much is padded NCCL allreduce over RDMA numerically better than MPI allgather over TCP? 5%, 10%, 50%?

Instructions we have for Open MPI indeed disable RDMA for MPI. The reason for that is that default OpenIB BTL doesn't work with multi-threading. I've just pushed a branch / PR that includes a flag that allows you to disable threading: #324

If you merge that PR in your working copy, you can use MPI RDMA by removing -mca btl ^openib and adding -x HOROVOD_MPI_THREADS_DISABLE=1. Can you try that and report whether that improves MPI allgather performance?

roastduck · 2018-06-20T08:49:20Z

@alsrgv, Sorry for saying padded allreduce gains some performance, which was wrong. This result came from testing a complete DNN model, where there are too many influence factors for performance.

This time, I did a dedicated test for allgather. I performed allgather on 10^9 float32 items, using EDR IB. See below:

Method	Negotiation (ms)	Transmission (ms)	Total (ms)
Original	1121.820	12314.193	13436.013
Padded allreduce	1641.637	21470.863	23112.500
PR #324	1049.802	4173.146	5222.948

Padded allreduce is actually slower, and MPI RDMA is much faster. However, when training my own model, padded allreduce is truly faster (I tested it again) :(

alsrgv · 2018-06-20T22:32:45Z

@roastduck, how many allgathers did you do for this test? The first operation can have an overhead of setting things up, especially in case of NCCL. Can you quantify how much your own model training is faster with padded allreduce?

roastduck · 2018-06-21T03:24:21Z

@alsrgv, the last result came from testing only 1 allgather. Now I performed a new test which allgather 10^8 float32 10 times, using the same environment.

Method	Total negotiation + transmission time (ms)
Original	14870.601
Padded allreduce	38593.184
PR #324	3036.732

As for my own model, I cannot get the exact timing because the timeline result is not stable, and there's the strange behavior we discussed in this issue. I evaluate the performance with the sample throughput. The padded allreduce method increases the throughput from around 190 sample/sec to around 200 sample/sec (in one of the complex configurations).

alsrgv · 2018-06-21T04:21:44Z

@roastduck, can you start measuring time after the first iteration completes? Otherwise, variability during startup can cause these numbers to be very random.

You can also try playing with -x HOROVOD_CYCLE_TIME=[...] environment variable which dictates delay between Horovod cycles, as described in https://github.com/uber/horovod/blob/master/docs/tensor-fusion.md. The default value is 5, which means 5 millisecond delay. I suspect that padded allreduce may simply make one of the cycles longer. You can try increasing and decreasing the value to find potentially better performance.

roastduck · 2018-06-21T04:42:14Z

@alsrgv, I made a mistake. I forgot to enable RDAM in MPI before performing the 3rd test above. Now it's fixed.

Here's the timing of the last 9 of the 10 allgathers in previous experiments. It doesn't vary much.

Method	Total negotiation + transmission time (ms)
Original	12266.511
Padded allreduce	34251.899
Padded allreduce (fusion disabled)	27125.241
PR #324	2465.120

Since there is actually no fusion happened in this test case (10^8 float32), I disabled fusion, and then the time dropped from 34251.899 ms to 27125.241 ms.

I did the measurement by summing up all the bars in the timeline result.

roastduck · 2018-06-21T04:57:31Z

Since you mentioned fusion, I guess my model might benefit from fuse the allgathers, which are actually padded allreduces, with other normal allreduces. I observed that this type of fusion really happened. This might explain why padded allreduce performs better in my model.

alsrgv · 2018-06-21T06:19:25Z

@roastduck, ah - these new numbers are much better. What if you try disabling fusion for your real model, do you see performance degradation of the padded allreduce? Sounds like we should add fused allgather.

roastduck · 2018-06-21T06:25:12Z

@alsrgv, in my model, there're 2 large allgathers, and many small allreduces. It doesn't make sense to disable fusing for my model, because those allreduces will surely lead to a performance degradation.

I'd like to see fused allgather. But if allgathers cannot be fused with allreduces, it won't be beneficial for me.

alsrgv · 2018-06-21T06:28:32Z

@roastduck, sounds like what you're doing is best for your model :-) You can still play with various cycle time settings to find the best value for your model.

anpark · 2018-08-04T16:14:28Z

very interesting discuss.

why not try ncclallgather? i'll try it #252
@roastduck @alsrgv

roastduck · 2018-08-05T01:36:58Z

@anpark, I found snuspl's fork tried to add NCCL allgather support, but it doesn't perform well. You may take a look.

anpark · 2018-08-05T02:39:32Z

@roastduck i'm doing this now.

by timeline i found two big performance problem:

new/delete for large buffer not efficiency, for example allgather, 300M buffer new/delete will cost much time, so should use buffer pool
mpi_allgather should change to ncclallgather, in my test, mpi_allgather too slow for large buffer

in my test, large tensor allgather is too slow, and is the bottleneck, could you give some advice?
i have 16 gpus, every batch we have 41000 * 128 float for each gpu to allgather, this comes from large emebedding lookup (gradient is IndexedSlices, so use allgather) ， so one gpu will get 15 X 128 X 41000 X 4 bytes = 305M+(gpu-cpu-network-cpu-gpu) but if we scale to 32 gpus, 64 gpus, maybe grow to 3G ? maybe much slower?

one more thing:

cudaMallocHost will use DMA without cpu, so cudaMemoryCpy is not needed?

alsrgv · 2018-08-09T19:21:17Z

Let's discuss allgather optimizations in this thread: #430.

alsrgv mentioned this issue Aug 9, 2018

Bugfix: Protect Timeline state from concurrent access #434

Merged

alsrgv closed this as completed in #434 Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something unexpected in timeline #316

Something unexpected in timeline #316

roastduck commented Jun 16, 2018

alsrgv commented Jun 16, 2018

roastduck commented Jun 17, 2018

roastduck commented Jun 17, 2018

alsrgv commented Jun 19, 2018

roastduck commented Jun 19, 2018

alsrgv commented Jun 19, 2018

roastduck commented Jun 20, 2018

roastduck commented Jun 20, 2018 •

edited

alsrgv commented Jun 20, 2018

roastduck commented Jun 20, 2018 •

edited

alsrgv commented Jun 20, 2018

roastduck commented Jun 21, 2018 •

edited

alsrgv commented Jun 21, 2018

roastduck commented Jun 21, 2018 •

edited

roastduck commented Jun 21, 2018

alsrgv commented Jun 21, 2018

roastduck commented Jun 21, 2018

alsrgv commented Jun 21, 2018 •

edited

anpark commented Aug 4, 2018 •

edited

roastduck commented Aug 5, 2018

anpark commented Aug 5, 2018 •

edited

alsrgv commented Aug 9, 2018

Something unexpected in timeline #316

Something unexpected in timeline #316

Comments

roastduck commented Jun 16, 2018

alsrgv commented Jun 16, 2018

roastduck commented Jun 17, 2018

roastduck commented Jun 17, 2018

alsrgv commented Jun 19, 2018

roastduck commented Jun 19, 2018

alsrgv commented Jun 19, 2018

roastduck commented Jun 20, 2018

roastduck commented Jun 20, 2018 • edited

alsrgv commented Jun 20, 2018

roastduck commented Jun 20, 2018 • edited

alsrgv commented Jun 20, 2018

roastduck commented Jun 21, 2018 • edited

alsrgv commented Jun 21, 2018

roastduck commented Jun 21, 2018 • edited

roastduck commented Jun 21, 2018

alsrgv commented Jun 21, 2018

roastduck commented Jun 21, 2018

alsrgv commented Jun 21, 2018 • edited

anpark commented Aug 4, 2018 • edited

roastduck commented Aug 5, 2018

anpark commented Aug 5, 2018 • edited

alsrgv commented Aug 9, 2018

roastduck commented Jun 20, 2018 •

edited

roastduck commented Jun 20, 2018 •

edited

roastduck commented Jun 21, 2018 •

edited

roastduck commented Jun 21, 2018 •

edited

alsrgv commented Jun 21, 2018 •

edited

anpark commented Aug 4, 2018 •

edited

anpark commented Aug 5, 2018 •

edited