PyTorch v1.0.0 multi-GPU compatibility issue #13

L0SG · 2018-12-21T03:50:27Z

Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.

candlewill · 2018-12-21T08:13:29Z

Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the params/flowavenet/ dir, only one checkpoint was generated.

Thanks for pointing out this.

L0SG · 2018-12-21T08:22:23Z

Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved.

L0SG · 2019-02-12T07:01:32Z

Update: the issue still persists in the latest 1.0.1 release.

Apex utilities https://github.com/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture. List of changes made in train.py: 1. Determine local_rank and world_size for torch.distributed.init_process_group 2. Set a current device with torch.cuda.set_device 3. Wrap dataset with torch.utils.data.distributed.DistributedSampler 4. Apply amp.scale_loss at each backward pass 5. Clip gradient with amp.master_params 6. Divide step_size by world_size (not sure if this is necessary) 7. Initialize model and optimizer with amp.initialize 8. Wrap model with apex.parallel.DistributedDataParallel 9. Handle evaluation and messages on the first node using args.local_rank Resolves: ksw0306#13 See also: ksw0306#16

L0SG · 2019-04-23T14:36:05Z

Note: DistributedDataParallel implementation from @1ytic circumvents the multi-GPU issue, so please use train_apex.py of the master branch until the issue from DataParallel (from train.py) is resolved.

L0SG · 2019-10-10T04:14:48Z

Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference.

L0SG mentioned this issue Dec 21, 2018

problem when synthesize #10

Closed

r9y9 mentioned this issue Mar 25, 2019

Does not converge under pytorch 1.0 and the latest dependencies r9y9/wavenet_vocoder#143

Closed

1ytic mentioned this issue Apr 22, 2019

Distributed Training with Apex #22

Merged

L0SG closed this as completed in #22 Apr 23, 2019

L0SG reopened this Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch v1.0.0 multi-GPU compatibility issue #13

PyTorch v1.0.0 multi-GPU compatibility issue #13

L0SG commented Dec 21, 2018

candlewill commented Dec 21, 2018 •

edited

Loading

L0SG commented Dec 21, 2018

L0SG commented Feb 12, 2019

L0SG commented Apr 23, 2019 •

edited

Loading

L0SG commented Oct 10, 2019

PyTorch v1.0.0 multi-GPU compatibility issue #13

PyTorch v1.0.0 multi-GPU compatibility issue #13

Comments

L0SG commented Dec 21, 2018

candlewill commented Dec 21, 2018 • edited Loading

L0SG commented Dec 21, 2018

L0SG commented Feb 12, 2019

L0SG commented Apr 23, 2019 • edited Loading

L0SG commented Oct 10, 2019

candlewill commented Dec 21, 2018 •

edited

Loading

L0SG commented Apr 23, 2019 •

edited

Loading