Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Received 1 death signal, shutting down workers #210

Closed
adonis-dym opened this issue Apr 12, 2022 · 2 comments
Closed

Error: Received 1 death signal, shutting down workers #210

adonis-dym opened this issue Apr 12, 2022 · 2 comments
Assignees
Labels
community discussion help wanted Extra attention is needed

Comments

@adonis-dym
Copy link

adonis-dym commented Apr 12, 2022

I'm using distributed training on mmrotate, but got the error several times.

Command to run
CUDA_VISIBLE_DEVICES=2,4 nohup tools/dist_train.sh /home/ymdong/mmrotate/configs/rotated_retinanet/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90.py 2
Errors

/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2022-04-12 14:34:01,098 - mmrotate - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
CUDA available: True
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29920130_0
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.12.0
OpenCV: 4.5.5
MMCV: 1.4.8
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMRotate: 0.1.1+b78bab2
------------------------------------------------------------

2022-04-12 14:34:02,973 - mmrotate - INFO - Distributed training: True
2022-04-12 14:34:04,498 - mmrotate - INFO - Config:
dataset_type = 'DOTADataset'
data_root = '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='RResize', img_scale=(1024, 1024)),
    dict(
        type='RRandomFlip',
        flip_ratio=[0.25, 0.25, 0.25],
        direction=['horizontal', 'vertical', 'diagonal'],
        version='le90'),
    dict(
        type='PolyRandomRotate',
        rotate_ratio=0.5,
        angles_range=180,
        auto_bound=False,
        rect_classes=[9, 11],
        version='le90'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='RResize'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='DOTADataset',
        ann_file=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/trainval/annfiles/',
        img_prefix=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/trainval/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='RResize', img_scale=(1024, 1024)),
            dict(
                type='RRandomFlip',
                flip_ratio=[0.25, 0.25, 0.25],
                direction=['horizontal', 'vertical', 'diagonal'],
                version='le90'),
            dict(
                type='PolyRandomRotate',
                rotate_ratio=0.5,
                angles_range=180,
                auto_bound=False,
                rect_classes=[9, 11],
                version='le90'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ],
        version='le90'),
    val=dict(
        type='DOTADataset',
        ann_file=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/trainval/annfiles/',
        img_prefix=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/trainval/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 1024),
                flip=False,
                transforms=[
                    dict(type='RResize'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        version='le90'),
    test=dict(
        type='DOTADataset',
        ann_file=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/test/images/',
        img_prefix=
        '/home/ymdong/mmrotate/data/DOTA/split_1024_dota1_0/test/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1024, 1024),
                flip=False,
                transforms=[
                    dict(type='RResize'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='DefaultFormatBundle'),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        version='le90'))
evaluation = dict(interval=1, metric='mAP')
optimizer = dict(
    type='AdamW',
    lr=0.0001,
    weight_decay=0.0001,
    paramwise_cfg=dict(
        custom_keys=dict(backbone=dict(lr_mult=0.1, decay_mult=1.0))))
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.3333333333333333,
    step=[50])
runner = dict(type='EpochBasedRunner', max_epochs=90)
checkpoint_config = dict(interval=30)
log_config = dict(
    interval=50,
    hooks=[dict(type='TextLoggerHook'),
           dict(type='TensorboardLoggerHook')])
dist_params = dict(backend='gloo')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
angle_version = 'le90'
model = dict(
    type='RotatedRetinaNet',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        zero_init_residual=False,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_input',
        num_outs=5),
    bbox_head=dict(
        type='RotatedRetinaHead',
        num_classes=15,
        in_channels=256,
        stacked_convs=4,
        feat_channels=256,
        assign_by_circumhbbox=None,
        anchor_generator=dict(
            type='RotatedAnchorGenerator',
            octave_base_scale=4,
            scales_per_octave=3,
            ratios=[1.0, 0.5, 2.0],
            strides=[8, 16, 32, 64, 128]),
        bbox_coder=dict(
            type='DeltaXYWHAOBBoxCoder',
            angle_range='le90',
            norm_factor=None,
            edge_swap=True,
            proj_xy=True,
            target_means=(0.0, 0.0, 0.0, 0.0, 0.0),
            target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    train_cfg=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.4,
            min_pos_iou=0,
            ignore_iof_thr=-1,
            iou_calculator=dict(type='RBboxOverlaps2D')),
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=2000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.1),
        max_per_img=2000))
work_dir = './work_dirs/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90'
auto_resume = False
gpu_ids = range(0, 2)

2022-04-12 14:34:14,635 - mmrotate - INFO - Set random seed to 105276235, deterministic: False
2022-04-12 14:34:15,294 - mmrotate - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'}
2022-04-12 14:34:15,295 - mmcv - INFO - load model from: torchvision://resnet50
2022-04-12 14:34:15,295 - mmcv - INFO - load checkpoint from torchvision path: torchvision://resnet50
2022-04-12 14:34:15,520 - mmcv - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2022-04-12 14:34:15,563 - mmrotate - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
2022-04-12 14:34:15,644 - mmrotate - INFO - initialize RotatedRetinaHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01, 'override': {'type': 'Normal', 'name': 'retina_cls', 'std': 0.01, 'bias_prob': 0.01}}
2022-04-12 14:34:26,879 - mmrotate - INFO - Start running, host: ymdong@ubuntu-zlin-12, work_dir: /home1/ymdong/mmrotate/work_dirs/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90
2022-04-12 14:34:26,880 - mmrotate - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) DistSamplerSeedHook                
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
before_val_epoch:
(NORMAL      ) DistSamplerSeedHook                
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
(VERY_LOW    ) TensorboardLoggerHook              
 -------------------- 
2022-04-12 14:34:26,881 - mmrotate - INFO - workflow: [('train', 1)], max: 90 epochs
2022-04-12 14:34:26,882 - mmrotate - INFO - Checkpoints will be saved to /home1/ymdong/mmrotate/work_dirs/rotated_retinanet_obb_r50_fpn_1x_dota_ms_rr_le90 by HardDiskBackend.
2022-04-12 14:35:03,972 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
2022-04-12 14:35:27,911 - mmrotate - INFO - Epoch [1][50/3200]	lr: 3.987e-05, eta: 4 days, 1:29:30, time: 1.219, data_time: 0.667, memory: 4200, loss_cls: 1.1572, loss_bbox: 1.1078, loss: 2.2650, grad_norm: 3.2876
2022-04-12 14:35:51,338 - mmrotate - INFO - Epoch [1][100/3200]	lr: 4.653e-05, eta: 2 days, 19:28:19, time: 0.469, data_time: 0.011, memory: 4200, loss_cls: 1.0677, loss_bbox: 1.0642, loss: 2.1319, grad_norm: 7.2970
2022-04-12 14:36:14,491 - mmrotate - INFO - Epoch [1][150/3200]	lr: 5.320e-05, eta: 2 days, 9:18:53, time: 0.463, data_time: 0.011, memory: 4200, loss_cls: 1.0120, loss_bbox: 1.0651, loss: 2.0772, grad_norm: 7.6860
2022-04-12 14:36:38,440 - mmrotate - INFO - Epoch [1][200/3200]	lr: 5.987e-05, eta: 2 days, 4:33:04, time: 0.479, data_time: 0.012, memory: 4200, loss_cls: 0.9809, loss_bbox: 1.0151, loss: 1.9960, grad_norm: 7.5417
2022-04-12 14:37:01,813 - mmrotate - INFO - Epoch [1][250/3200]	lr: 6.653e-05, eta: 2 days, 1:30:21, time: 0.467, data_time: 0.011, memory: 4200, loss_cls: 0.9418, loss_bbox: 1.0400, loss: 1.9817, grad_norm: 10.1703
2022-04-12 14:37:24,914 - mmrotate - INFO - Epoch [1][300/3200]	lr: 7.320e-05, eta: 1 day, 23:24:06, time: 0.462, data_time: 0.012, memory: 4200, loss_cls: 0.9159, loss_bbox: 1.0029, loss: 1.9188, grad_norm: 9.6490
2022-04-12 14:37:48,272 - mmrotate - INFO - Epoch [1][350/3200]	lr: 7.987e-05, eta: 1 day, 21:57:19, time: 0.467, data_time: 0.011, memory: 4200, loss_cls: 0.8957, loss_bbox: 0.9661, loss: 1.8618, grad_norm: 10.7941
2022-04-12 14:38:12,062 - mmrotate - INFO - Epoch [1][400/3200]	lr: 8.653e-05, eta: 1 day, 20:57:12, time: 0.476, data_time: 0.011, memory: 4200, loss_cls: 0.9220, loss_bbox: 0.9910, loss: 1.9130, grad_norm: 6.9357
2022-04-12 14:38:35,571 - mmrotate - INFO - Epoch [1][450/3200]	lr: 9.320e-05, eta: 1 day, 20:07:23, time: 0.470, data_time: 0.010, memory: 4200, loss_cls: 0.8605, loss_bbox: 0.9852, loss: 1.8458, grad_norm: 11.9324
2022-04-12 14:38:59,713 - mmrotate - INFO - Epoch [1][500/3200]	lr: 9.987e-05, eta: 1 day, 19:33:37, time: 0.483, data_time: 0.011, memory: 4200, loss_cls: 0.8954, loss_bbox: 0.9542, loss: 1.8496, grad_norm: 7.5572
2022-04-12 14:39:23,870 - mmrotate - INFO - Epoch [1][550/3200]	lr: 1.000e-04, eta: 1 day, 19:06:03, time: 0.483, data_time: 0.011, memory: 4200, loss_cls: 0.8464, loss_bbox: 1.0325, loss: 1.8788, grad_norm: 7.6418
2022-04-12 14:39:47,727 - mmrotate - INFO - Epoch [1][600/3200]	lr: 1.000e-04, eta: 1 day, 18:40:39, time: 0.477, data_time: 0.010, memory: 4200, loss_cls: 0.8096, loss_bbox: 1.0087, loss: 1.8183, grad_norm: 8.4474
2022-04-12 14:40:11,718 - mmrotate - INFO - Epoch [1][650/3200]	lr: 1.000e-04, eta: 1 day, 18:20:01, time: 0.480, data_time: 0.010, memory: 4200, loss_cls: 0.8311, loss_bbox: 0.9780, loss: 1.8091, grad_norm: 8.6337
2022-04-12 14:40:35,670 - mmrotate - INFO - Epoch [1][700/3200]	lr: 1.000e-04, eta: 1 day, 18:02:02, time: 0.479, data_time: 0.010, memory: 4200, loss_cls: 0.7882, loss_bbox: 0.9562, loss: 1.7443, grad_norm: 7.1134
2022-04-12 14:40:59,099 - mmrotate - INFO - Epoch [1][750/3200]	lr: 1.000e-04, eta: 1 day, 17:43:02, time: 0.469, data_time: 0.010, memory: 4200, loss_cls: 0.8212, loss_bbox: 0.9642, loss: 1.7853, grad_norm: 10.3099
2022-04-12 14:41:23,104 - mmrotate - INFO - Epoch [1][800/3200]	lr: 1.000e-04, eta: 1 day, 17:29:38, time: 0.479, data_time: 0.010, memory: 4200, loss_cls: 0.7197, loss_bbox: 0.8624, loss: 1.5821, grad_norm: 7.7448
2022-04-12 14:41:46,832 - mmrotate - INFO - Epoch [1][850/3200]	lr: 1.000e-04, eta: 1 day, 17:16:32, time: 0.475, data_time: 0.011, memory: 4200, loss_cls: 0.7427, loss_bbox: 0.8950, loss: 1.6377, grad_norm: 6.6609
2022-04-12 14:42:11,026 - mmrotate - INFO - Epoch [1][900/3200]	lr: 1.000e-04, eta: 1 day, 17:07:04, time: 0.483, data_time: 0.010, memory: 4200, loss_cls: 0.6902, loss_bbox: 0.9706, loss: 1.6607, grad_norm: 7.5592
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3340504 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3340505 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/ymdong/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ymdong/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/ymdong/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3340472 got signal: 1

I have received this error message many times. Referring to the issue open-mmlab/mmdetection#6534, I change dist_params = dict(backend='nccl') to dist_params = dict(backend='gloo'), but it still doesn't work.

I have 3 computers, 2 of them always report this error, and another one never report this. Does anyone have any idea?

@zhanggefan
Copy link
Collaborator

@adonis-dym
This seems to be a known issue of PyTorch according to the discussion here:

https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/12

Could you please try one more time with a terminal multiplexer like tmux or screen? Trust me these tools bring a much better visual experience(especially tmux) than nohup.

@adonis-dym
Copy link
Author

@adonis-dym This seems to be a known issue of PyTorch according to the discussion here:

https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/12

Could you please try one more time with a terminal multiplexer like tmux or screen? Trust me these tools bring a much better visual experience(especially tmux) than nohup.

The problem lies in the command nohup, the distributed training process operated by nohup will receive the above SIGHUP signal when closing the terminal, even if we specify the command nohup. Switch to tmux will resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants