Add process set support for MXNet #3043

maxhgerlach · 2021-07-13T09:17:04Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This is a straightforward extension of #2839, adding the process set feature to the MXNet API.

Included are the ops allgather, allreduce, alltoall, broadcast, and grouped_allreduce, as well as the DistributedOptimizer.

There's also the DistributedTrainer that I didn't look into. I don't have much experience working with MXNet, but it seems that a Gluon Trainer would be a higher level concept meant to operate on an entire neural net. In that case adding an overall process_set argument might not be the right call as users will typically want to non-globally aggregate the gradients for just a subset of their parameters, not for the entire model.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

github-actions · 2021-07-23T23:28:41Z

Unit Test Results

    784 files ±0     784 suites ±0 6h 37m 3s ⏱️ ±0s
    639 tests ±0     596 ✔️ ±0     42 💤 ±0 1 ❌ ±0
16 088 runs ±0 12 458 ✔️ ±0 3 629 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 0a42194. ± Comparison against base commit 0a42194.

♻️ This comment has been updated with latest results.

maxhgerlach mentioned this pull request Jul 13, 2021

Concurrently running collective operations on process subsets [TensorFlow] #2839

Merged

21 tasks

maxhgerlach added 4 commits July 13, 2021 18:00

Add process set support for MXNet ops

1f053bf

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Add process set support for MXNet DistributedOptimizer

4a93a62

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fixes for MXNet < 2 and building without GPU support

375fb65

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Fix test_optimizer_process_set to work with mxnet==1.5.1.post0

9823377

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach force-pushed the subset-mxnet-pr branch from 2f3f79d to 9823377 Compare July 13, 2021 16:01

tgaddair approved these changes Jul 23, 2021

View reviewed changes

tgaddair merged commit 0a42194 into horovod:master Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process set support for MXNet #3043

Add process set support for MXNet #3043

maxhgerlach commented Jul 13, 2021

github-actions bot commented Jul 23, 2021 •

edited

Add process set support for MXNet #3043

Add process set support for MXNet #3043

Conversation

maxhgerlach commented Jul 13, 2021

Checklist before submitting

Description

github-actions bot commented Jul 23, 2021 • edited

Unit Test Results

github-actions bot commented Jul 23, 2021 •

edited