You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can run distributed training with V100 GPUs however when running with TITAN X or TITAN Xp GPUs it got hang, even though they have same nvidia driver, cuda, gcc, pytorch versions. I had read other threads about Multi-GPU hang, but in my case, experiments cannot run any single iteration (not random hang like others). I had tried to update driver, cuda, pytorch-1.0.0 but it doesnt help.
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=16.05s)
creating index...
Done (t=16.15s)
creating index...
index created!
index created!
Done (t=18.86s)
creating index...
Done (t=18.97s)
creating index...
index created!
index created!
2018-12-09 21:55:56,641 - INFO - Start running, host: thangvu@slsp-node-2, work_dir: /mnt/hdd/thangvu/workspace/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2018-12-09 21:55:56,641 - INFO - workflow: [('train', 1)], max: 12 epochs
I installed pytorch using pip. I tried to install (with the same procedure as before) but on a clean docker container and now it works. It is still weird to me but anyway, it can be a workaround.
I can run distributed training with V100 GPUs however when running with TITAN X or TITAN Xp GPUs it got hang, even though they have same nvidia driver, cuda, gcc, pytorch versions. I had read other threads about Multi-GPU hang, but in my case, experiments cannot run any single iteration (not random hang like others). I had tried to update driver, cuda, pytorch-1.0.0 but it doesnt help.
My system is:
The text was updated successfully, but these errors were encountered: