Sync Batch Norm for PyTorch #1923

alsrgv · 2020-04-29T18:54:49Z

Implementation of https://pytorch.org/docs/stable/nn.html#syncbatchnorm using Horovod. Current version uses optimized CUDA kernels written for PyTorch and have two different invocations because they changed. We can consider making our own implementation if it turns out to be a hassle.

Why Sync Batch Norm?

As evidenced by https://arxiv.org/abs/1804.07612, small batches are great for training neural networks. However, https://arxiv.org/abs/1803.08494 noted that multi-GPU training with small-batch BatchNorm is detrimental to performance.

SyncBatchNorm improves training where each worker can only hold few examples (1..4) and total number of workers is not too high - at which point it starts to lose its regularization abilities.

Fixes #1384

tgaddair

This is awesome! Just a few nits.

horovod/torch/sync_batch_norm.py

tgaddair · 2020-04-30T01:31:13Z

horovod/torch/sync_batch_norm.py

+            self.eps, self.momentum)
+
+    def forward(self, input):
+        # currently only GPU input is supported


Can you add a TODO stating what would be needed to get it work on CPU?

This is constrained by PyTorch internal kernels that we use. They're only implemented on GPU. If we want to make it work on CPU, we should consider rolling our own implementation w/o dependency on PyTorch internals.

horovod/torch/sync_batch_norm.py

tgaddair · 2020-04-30T01:35:07Z

test/test_torch.py

@@ -1667,6 +1667,59 @@ def test_horovod_join_broadcast(self):
                ret = hvd.join(hvd.local_rank())
            else:
                ret = hvd.join()
+
+    def test_horovod_sync_batch_norm(self):


May also want to include a skip if Python version is < 3 (integration tests passing because we have no Python 2 GPU tests anymore).

Added skip, but we should deprecate all Py2 tests :-)

Agreed. Tentative plan is to drop Python 2 in 0.20.0.

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

tgaddair

LGTM! Let's merge it!

thuyen · 2020-05-25T00:00:16Z

horovod/torch/sync_batch_norm.py

+        count_handle = allgather_async(count.unsqueeze(0), name='sync_batch_norm.count')
+        mean_handle = allgather_async(mean.unsqueeze(0), name='sync_batch_norm.mean')
+        invstd_handle = allgather_async(invstd.unsqueeze(0), name='sync_batch_norm.invstd')


We would have a lot of BatchNorm instances in a network. Would we need to add the instance id to the name arguments in allreduce_async calls here and below (line 157, 158) ?

Didn't seem necessary so far - current implementation of SyncBN requires each worker to execute model in the same sequence with respect to batch norms.

Got you thanks!

thuyen · 2020-05-25T04:45:55Z

horovod/torch/sync_batch_norm.py

+                # before 1.6.0, sum_dy was sum of means from every worker, so we just 
+                # need to divide it by number of workers
+                mean_dy = sum_dy / size()
+                mean_dy_xmu = sum_dy_xmu / size()


I think we should divide by count_all.sum() for both pytorch 1.5 and 1.6.

For pytorch 1.5, right now if I comment out _run_bn code path (not using torch.nn.BatchNorm) and run the following on v0.19.3 locally:

torch.cuda.manual_seed(2020) bn = torch.nn.BatchNorm2d(10).cuda() bn_hvd = SyncBatchNorm(10).cuda() x = torch.rand(3, 10, 8, 8).cuda() x1 = x.clone().requires_grad_() x2 = x.clone().requires_grad_() y = bn(x1) y_hvd = bn_hvd(x2) y.sum().backward() y_hvd.sum().backward() print((x1.grad - x2.grad).abs().sum())

I got

tensor(1298650., device='cuda:0')

But if we use count_all.sum() here, the result is more reasonable:

tensor(0.0004, device='cuda:0')

Great catch, I've raised #1980 to address this. Apparently they made breaking changes in both 1.5.0 and 1.6.0.

eric-haibin-lin · 2020-07-13T16:30:01Z

Does SyncBatchNorm support the use case where I have 2 nodes, each with 8 GPUs, and the sync only happens intra-node instead of inter-node (sync 2 groups of 8 workers instead of 16)

alsrgv · 2020-07-13T19:57:34Z

@eric-haibin-lin, this implementation does not, but please feel free to extend it!

romerojosh · 2020-07-24T18:33:26Z

@alsrgv @eric-haibin-lin I think the main limitation here is that Horovod still is currently designed around performing collectives across the global communicator, with no options currently to perform collectives on subsets of workers. @tgaddair, perhaps this is something we should begin to think about.

tgaddair · 2020-07-24T18:49:04Z

Good point, @romerojosh. Added #2139 to track.

alsrgv requested review from tgaddair and abditag2 April 29, 2020 18:54

alsrgv force-pushed the sync_bn branch from f1c7866 to 8745cda Compare April 29, 2020 18:55

alsrgv mentioned this pull request Apr 29, 2020

Sync Batch Norm for Pytorch #1384

Closed

tgaddair reviewed Apr 30, 2020

View reviewed changes

alsrgv added 6 commits April 30, 2020 11:00

Horovod SyncBatchNorm for PyTorch

4b0f609

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

Don't need Horovod initialized to run in eval mode

3680c3e

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

JIT-friendly fallback to BatchNorm in eval

4ee2c80

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

Backward compat for old PyTorch

cc6a00d

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

Enable single-element batch norm

34a637a

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

Address review comments

462e4cf

Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>

alsrgv force-pushed the sync_bn branch from 0c74b99 to 462e4cf Compare April 30, 2020 18:00

tgaddair approved these changes Apr 30, 2020

View reviewed changes

tgaddair merged commit 2a3f43f into master Apr 30, 2020

tgaddair deleted the sync_bn branch April 30, 2020 20:12

thuyen reviewed May 25, 2020

View reviewed changes

tgaddair mentioned this pull request Jul 24, 2020

Extend API to allow running collective ops across a subset of ranks (cross, local, etc.) #2139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync Batch Norm for PyTorch #1923

Sync Batch Norm for PyTorch #1923

alsrgv commented Apr 29, 2020 •

edited

tgaddair left a comment

tgaddair Apr 30, 2020

alsrgv Apr 30, 2020

tgaddair Apr 30, 2020

alsrgv Apr 30, 2020

tgaddair Apr 30, 2020

tgaddair left a comment

thuyen May 25, 2020

alsrgv May 25, 2020

thuyen May 25, 2020

thuyen May 25, 2020 •

edited

alsrgv May 25, 2020

eric-haibin-lin commented Jul 13, 2020

alsrgv commented Jul 13, 2020

romerojosh commented Jul 24, 2020

tgaddair commented Jul 24, 2020

Sync Batch Norm for PyTorch #1923

Sync Batch Norm for PyTorch #1923

Conversation

alsrgv commented Apr 29, 2020 • edited

Why Sync Batch Norm?

tgaddair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thuyen May 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Jul 13, 2020

alsrgv commented Jul 13, 2020

romerojosh commented Jul 24, 2020

tgaddair commented Jul 24, 2020

alsrgv commented Apr 29, 2020 •

edited

thuyen May 25, 2020 •

edited