Distributed training hang with TITAN X and TITAN Xp GPUs #159

thangvubk · 2018-12-09T13:07:29Z

I can run distributed training with V100 GPUs however when running with TITAN X or TITAN Xp GPUs it got hang, even though they have same nvidia driver, cuda, gcc, pytorch versions. I had read other threads about Multi-GPU hang, but in my case, experiments cannot run any single iteration (not random hang like others). I had tried to update driver, cuda, pytorch-1.0.0 but it doesnt help.

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=16.05s)
creating index...
Done (t=16.15s)
creating index...
index created!
index created!
Done (t=18.86s)
creating index...
Done (t=18.97s)
creating index...
index created!
index created!
2018-12-09 21:55:56,641 - INFO - Start running, host: thangvu@slsp-node-2, work_dir: /mnt/hdd/thangvu/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2018-12-09 21:55:56,641 - INFO - workflow: [('train', 1)], max: 12 epochs

[0] TITAN Xp         | 55'C, 100 % |   865 / 12196 MB | thangvu(855M)
[1] TITAN Xp         | 54'C, 100 % |   865 / 12196 MB | thangvu(855M)
[2] TITAN Xp         | 51'C, 100 % |   861 / 12196 MB | thangvu(851M)
[3] TITAN Xp         | 51'C, 100 % |   867 / 12196 MB | thangvu(857M)

My system is:

Ubuntu 16.04 LTS
pytorch 4.0.1
cuda 9.0
nvidia-driver 396.51
gcc 5.4

The text was updated successfully, but these errors were encountered:

hellock · 2018-12-10T08:23:55Z

We have TITAN X and Xp servers with exactly the same OS and software environment, and they work fine.
How is PyTorch installed? via conda or source?

thangvubk · 2018-12-10T09:07:05Z

I installed pytorch using pip. I tried to install (with the same procedure as before) but on a clean docker container and now it works. It is still weird to me but anyway, it can be a workaround.

thangvubk mentioned this issue Dec 12, 2018

the training process may get into stuck #166

Closed

thangvubk closed this as completed Dec 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training hang with TITAN X and TITAN Xp GPUs #159

Distributed training hang with TITAN X and TITAN Xp GPUs #159

thangvubk commented Dec 9, 2018

hellock commented Dec 10, 2018

thangvubk commented Dec 10, 2018

Distributed training hang with TITAN X and TITAN Xp GPUs #159

Distributed training hang with TITAN X and TITAN Xp GPUs #159

Comments

thangvubk commented Dec 9, 2018

hellock commented Dec 10, 2018

thangvubk commented Dec 10, 2018