-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch v1.0.0 multi-GPU compatibility issue #13
Comments
Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the Thanks for pointing out this. |
Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved. |
Update: the issue still persists in the latest 1.0.1 release. |
Apex utilities https://github.com/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture. List of changes made in train.py: 1. Determine local_rank and world_size for torch.distributed.init_process_group 2. Set a current device with torch.cuda.set_device 3. Wrap dataset with torch.utils.data.distributed.DistributedSampler 4. Apply amp.scale_loss at each backward pass 5. Clip gradient with amp.master_params 6. Divide step_size by world_size (not sure if this is necessary) 7. Initialize model and optimizer with amp.initialize 8. Wrap model with apex.parallel.DistributedDataParallel 9. Handle evaluation and messages on the first node using args.local_rank Resolves: ksw0306#13 See also: ksw0306#16
Note: |
Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference. |
Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.
The text was updated successfully, but these errors were encountered: