Use torch.nn.DataParallel for intra-node computation #46

tlin-taolin · 2020-04-03T13:11:41Z

It might be a good choice to use torch.nn.DataParallel for intra-node computation (across multi-GPUs) and intra-node gradient aggregation, and then use different communication backends for inter-node communication.

The text was updated successfully, but these errors were encountered:

ehoelzl · 2020-04-03T15:34:40Z

Wouldn't it be more appropriate to use DistributedDataParallel, as referenced here ?

tlin-taolin · 2020-04-03T16:48:25Z

I think it should be a standard baseline for us to compare with, and we should check both of DataParallel and DistributedDataParallel for the intra-node case. But indeed I am not sure if we can directly use DistributedDataParallel in our current framework for intra-node communication (it also requires to run init_process_group).

BTW, I think the DistributedDataParallel is mainly designed/optimized for centralized training across multi-nodes but it loses the flexibility to use different communication strategies (e.g. compressed gradient), different communication topologies (e.g. ring topology for decentralized training). It is good to get the results from DistributedDataParallel (for distributed training) and compares it with our sync scheme.

martinjaggi · 2020-05-15T18:20:24Z

DDP overlaps gradient computation with the communication. the effect should be noticeable. how does it compare to our reference implementations?

ehoelzl · 2020-12-04T10:44:04Z

Closing in favor of mlbench/mlbench-benchmarks#69

tlin-taolin changed the title ~~Use Data~~ Use torch.nn.DataParallel for in-node computation Apr 3, 2020

tlin-taolin changed the title ~~Use torch.nn.DataParallel for in-node computation~~ Use torch.nn.DataParallel for intra-node computation Apr 3, 2020

tlin-taolin assigned ehoelzl Apr 3, 2020

tlin-taolin added the mlbench-core label Apr 3, 2020

ehoelzl removed the mlbench-core label Nov 9, 2020

ehoelzl added this to To do in mlbench-3.1.0 via automation Dec 4, 2020

ehoelzl mentioned this issue Dec 4, 2020

Use torch.nn.DataParallel for intra-node computation mlbench/mlbench-benchmarks#69

Open

ehoelzl closed this as completed Dec 4, 2020

mlbench-3.1.0 automation moved this from To do to Done Dec 4, 2020

ehoelzl removed this from Done in mlbench-3.1.0 Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use torch.nn.DataParallel for intra-node computation #46

Use torch.nn.DataParallel for intra-node computation #46

tlin-taolin commented Apr 3, 2020 •

edited

Loading

ehoelzl commented Apr 3, 2020 •

edited

Loading

tlin-taolin commented Apr 3, 2020

martinjaggi commented May 15, 2020

ehoelzl commented Dec 4, 2020

Use torch.nn.DataParallel for intra-node computation #46

Use torch.nn.DataParallel for intra-node computation #46

Comments

tlin-taolin commented Apr 3, 2020 • edited Loading

ehoelzl commented Apr 3, 2020 • edited Loading

tlin-taolin commented Apr 3, 2020

martinjaggi commented May 15, 2020

ehoelzl commented Dec 4, 2020

tlin-taolin commented Apr 3, 2020 •

edited

Loading

ehoelzl commented Apr 3, 2020 •

edited

Loading