Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training hang with TITAN X and TITAN Xp GPUs #159

Closed
thangvubk opened this issue Dec 9, 2018 · 2 comments
Closed

Distributed training hang with TITAN X and TITAN Xp GPUs #159

thangvubk opened this issue Dec 9, 2018 · 2 comments

Comments

@thangvubk
Copy link
Contributor

I can run distributed training with V100 GPUs however when running with TITAN X or TITAN Xp GPUs it got hang, even though they have same nvidia driver, cuda, gcc, pytorch versions. I had read other threads about Multi-GPU hang, but in my case, experiments cannot run any single iteration (not random hang like others). I had tried to update driver, cuda, pytorch-1.0.0 but it doesnt help.

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=16.05s)
creating index...
Done (t=16.15s)
creating index...
index created!
index created!
Done (t=18.86s)
creating index...
Done (t=18.97s)
creating index...
index created!
index created!
2018-12-09 21:55:56,641 - INFO - Start running, host: thangvu@slsp-node-2, work_dir: /mnt/hdd/thangvu/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2018-12-09 21:55:56,641 - INFO - workflow: [('train', 1)], max: 12 epochs
[0] TITAN Xp         | 55'C, 100 % |   865 / 12196 MB | thangvu(855M)
[1] TITAN Xp         | 54'C, 100 % |   865 / 12196 MB | thangvu(855M)
[2] TITAN Xp         | 51'C, 100 % |   861 / 12196 MB | thangvu(851M)
[3] TITAN Xp         | 51'C, 100 % |   867 / 12196 MB | thangvu(857M)

My system is:

  • Ubuntu 16.04 LTS
  • pytorch 4.0.1
  • cuda 9.0
  • nvidia-driver 396.51
  • gcc 5.4
@hellock
Copy link
Member

hellock commented Dec 10, 2018

We have TITAN X and Xp servers with exactly the same OS and software environment, and they work fine.
How is PyTorch installed? via conda or source?

@thangvubk
Copy link
Contributor Author

I installed pytorch using pip. I tried to install (with the same procedure as before) but on a clean docker container and now it works. It is still weird to me but anyway, it can be a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants