Whether to support grouped_allgather in the future? #3325

wuyujiji · 2021-12-16T04:11:42Z

Hello, The horovod0.21.0 support grouped allreduce to enable more efficient tensor fusion and deterministic training. Will the group mechanism be migrated to allgather?

The situation is that I find the forward of the SyncBatchnorm executes three allgather (count ,mean and std) will cost much time, since this three allgather will generate three requests in horovod and grouped_allgather only generates one request (as explained grouped_allreduce).

  # calculate mean/invstd for input.
  mean, invstd = torch.batch_norm_stats(input, eps)

  count_handle = allgather_async(count.unsqueeze(0), name='sync_batch_norm.count')
  mean_handle = allgather_async(mean.unsqueeze(0), name='sync_batch_norm.mean')
  invstd_handle = allgather_async(invstd.unsqueeze(0), name='sync_batch_norm.invstd')

  # wait on the async communication to finish
  count_all = synchronize(count_handle)
  mean_all = synchronize(mean_handle)
  invstd_all = synchronize(invstd_handle)

In backward, I can replace two allreduces with grouped_allreduce for sum_dy and sum_dy_xmu, such bellow:

# The original implementation by allreduce
sum_dy_handle = allreduce_async(sum_dy, op=Sum, name='sync_batch_norm.sum_dy')
sum_dy_xmu_handle = allreduce_async(sum_dy_xmu, op=Sum, name='sync_batch_norm.sum_dy_xmu')
sum_dy = synchronize(sum_dy_handle)
sum_dy_xmu = synchronize(sum_dy_xmu_handle)

# The implementation by grouped_allreduce
sum_dy, sum_dy_xmu = grouped_allreduce([sum_dy, sum_dy_xmu], op=Sum, name='sync_batch_norm.sum_dy_and_dy_xmu')

For the implementation of apex SyncBatchnorm, which use grouped_allgather in foward and grouped_allreduce in backward via directly using torch.distributed module. In my experiment, the performance of apex SyncBatchnorm is better than horovod.

# foward
count_t = torch.empty(1, dtype=mean.dtype, device=mean.device).fill_(count)
combined = torch.cat([mean.view(-1), var_biased.view(-1), count_t], dim=0)
combined_list = [torch.empty_like(combined) for k in range(world_size)]
torch.distributed.all_gather(combined_list, combined, process_group)
combined = torch.stack(combined_list, dim=0)
mean_all, invstd_all, count_all = torch.split(combined, num_channels, dim=1)

# backward
num_channels = sum_dy.shape[0]
combined = torch.cat([sum_dy, sum_dy_xmu], dim=0)
torch.distributed.all_reduce(combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
sum_dy, sum_dy_xmu = torch.split(combined, num_channels)

wuyujiji · 2021-12-20T02:28:49Z

@tgaddair @romerojosh

stale · 2022-02-18T04:08:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maxhgerlach · 2022-07-06T19:40:10Z

@wuyujiji, I've started to work on this in #3594. Support for PyTorch is still missing, though.

stale bot added the wontfix label Feb 18, 2022

stale bot closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether to support grouped_allgather in the future? #3325

Whether to support grouped_allgather in the future? #3325

wuyujiji commented Dec 16, 2021 •

edited

wuyujiji commented Dec 20, 2021

stale bot commented Feb 18, 2022

maxhgerlach commented Jul 6, 2022

Whether to support grouped_allgather in the future? #3325

Whether to support grouped_allgather in the future? #3325

Comments

wuyujiji commented Dec 16, 2021 • edited

wuyujiji commented Dec 20, 2021

stale bot commented Feb 18, 2022

maxhgerlach commented Jul 6, 2022

wuyujiji commented Dec 16, 2021 •

edited