You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using distributed training on mmrotate, but got the error several times.
Command to run CUDA_VISIBLE_DEVICES=2,4 nohup tools/dist_train.sh /home/ymdong/mmrotate/configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90.py 2 Errors
I have received this error message many times. Referring to the issue open-mmlab/mmdetection#6534, I change dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo'), but it still doesn't work.
I have 3 computers, 2 of them always report this error, and another one never report this. Does anyone have any idea?
The text was updated successfully, but these errors were encountered:
Could you please try one more time with a terminal multiplexer like tmux or screen? Trust me these tools bring a much better visual experience(especially tmux) than nohup.
Could you please try one more time with a terminal multiplexer like tmux or screen? Trust me these tools bring a much better visual experience(especially tmux) than nohup.
The problem lies in the command nohup, the distributed training process operated by nohup will receive the above SIGHUP signal when closing the terminal, even if we specify the command nohup. Switch to tmux will resolve this issue.
I'm using distributed training on mmrotate, but got the error several times.
Command to run
CUDA_VISIBLE_DEVICES=2,4 nohup tools/dist_train.sh /home/ymdong/mmrotate/configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90.py 2
Errors
I have received this error message many times. Referring to the issue open-mmlab/mmdetection#6534, I change
dist_params = dict(backend='nccl')
todist_params = dict(backend='gloo')
, but it still doesn't work.I have 3 computers, 2 of them always report this error, and another one never report this. Does anyone have any idea?
The text was updated successfully, but these errors were encountered: