Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch v1.0.0 multi-GPU compatibility issue #13

Open
L0SG opened this issue Dec 21, 2018 · 5 comments · Fixed by #22
Open

PyTorch v1.0.0 multi-GPU compatibility issue #13

L0SG opened this issue Dec 21, 2018 · 5 comments · Fixed by #22

Comments

@L0SG
Copy link
Collaborator

L0SG commented Dec 21, 2018

Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.

@candlewill
Copy link

candlewill commented Dec 21, 2018

Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the params/flowavenet/ dir, only one checkpoint was generated.

Thanks for pointing out this.

@L0SG
Copy link
Collaborator Author

L0SG commented Dec 21, 2018

Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved.

@L0SG
Copy link
Collaborator Author

L0SG commented Feb 12, 2019

Update: the issue still persists in the latest 1.0.1 release.

1ytic added a commit to 1ytic/FloWaveNet that referenced this issue Apr 22, 2019
Apex utilities https://github.com/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture.

List of changes made in train.py:
1. Determine local_rank and world_size for torch.distributed.init_process_group
2. Set a current device with torch.cuda.set_device
3. Wrap dataset with torch.utils.data.distributed.DistributedSampler
4. Apply amp.scale_loss at each backward pass
5. Clip gradient with amp.master_params
6. Divide step_size by world_size (not sure if this is necessary)
7. Initialize model and optimizer with amp.initialize
8. Wrap model with apex.parallel.DistributedDataParallel
9. Handle evaluation and messages on the first node using args.local_rank

Resolves: ksw0306#13
See also: ksw0306#16
@L0SG L0SG closed this as completed in #22 Apr 23, 2019
@L0SG L0SG reopened this Apr 23, 2019
@L0SG
Copy link
Collaborator Author

L0SG commented Apr 23, 2019

Note: DistributedDataParallel implementation from @1ytic circumvents the multi-GPU issue, so please use train_apex.py of the master branch until the issue from DataParallel (from train.py) is resolved.

@L0SG
Copy link
Collaborator Author

L0SG commented Oct 10, 2019

Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants