Model training stops after validation after 4000 iterations #27

purbayankar · 2021-09-21T16:32:51Z

After training for 4000 iterations the validation happens and after that the training stops throwing the following error:

raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_05:54:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2210236)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I am training with 2 gpus. Do you have any insight why this error is being thrown?

The text was updated successfully, but these errors were encountered:

MendelXu · 2021-09-21T17:37:03Z

There lacks enough error information. I have no idea what the error is. Could you post the entire log here?

purbayankar · 2021-09-21T18:48:26Z

Thank you very much for the prompt reply. I am sharing the entire error log as you asked:

2021-09-22 08:08:54,500 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations
2021-09-22 08:09:01,005 - mmdet.ssod - INFO - Exp name: soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py
2021-09-22 08:09:01,005 - mmdet.ssod - INFO - Iter(val) [4000]  ema_momentum: 0.9990, sup_loss_rpn_cls: 0.2146, sup_loss_rpn_bbox: 0.1890, sup_loss_cls: 0.5416, sup_acc: 81.9336, sup_loss_bbox: 0.5341, unsup_loss_rpn_cls: 0.1201, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0561, unsup_acc: 99.8535, unsup_loss_bbox: 0.0753, loss: 1.7308
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 49.7 task/s, elapsed: 101s, ETA:     0s2021-09-22 08:10:44,352 - mmdet.ssod - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.11s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=17.67s).
Accumulating evaluation results...
DONE (t=4.36s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.015
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.036
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.021
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.051
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.050
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 49.0 task/s, elapsed: 102s, ETA:     0sTraceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/base.py", line 152, in after_train_iter
    self.log(runner)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 177, in log
    self._log_info(log_dict, runner)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 97, in _log_info
    f'data_time: {log_dict["data_time"]:.3f}, '
KeyError: 'data_time'
2021-09-22 08:12:52,009 - mmdet.ssod - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.12s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2253381 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2253382) of binary: /home/pankaj/anaconda3/envs/py39/bin/python
Traceback (most recent call last):
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_08:12:59
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2253382)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

leeeeeng · 2021-09-22T02:16:50Z

I met this problem and solved it by installing mmcv=1.3.9 as the information in readme document. The problem happened when the version of mmcv is 1.3.12.

I can't find the exact problem, but using a lower version of mmcv really works for me.

MendelXu · 2021-09-22T03:02:09Z

~~I find some code in the new commit causes the problem but have no idea how to solve it properly now. So I will revert to the old version...~~
It should be fixed now. In my local machine, I have tested with mmcv 1.3.9 and mmcv 1.3.12 for 8000 iterations, and it works fine. Could you update the code and have a try?

purbayankar · 2021-09-22T07:06:15Z

Thank you very much @leeeeeng and @MendelXu. I got the error resolved following your suggestions. As I am currently using 2 gpus it shows 3 days for complete training. Can you please tell me in which folder model weights will be saved after training is completed?

MendelXu · 2021-09-22T07:15:46Z

By default, the weights are stored in work_dirs/{CFG_NAME}{PERCENT}/{FOLD}. If you are running the code with option --work-dir {WORK_DIR}, the weights are stored in {WORK_DIR}. By the way, as you are using 2 GPU, have you tried to adjust the learning rate (for example, reduce it to 1/4 of the original one)? The default learning rate may be too large for a smaller batch size.

purbayankar · 2021-09-22T07:40:18Z

@MendelXu thank you very much for the prompt reply. In the work_dirs/{CFG_NAME}{PERCENT}/{FOLD} folder I can only see .log and .log.json files.But I can't see any .pth file. Does the weights get saved after the complete 180k iterations are completed? And thank you very much for your suggestion. I am using a smaller learning rate as my batch size is small.

MendelXu · 2021-09-22T07:47:30Z

No. The weight is stored every 4000 iterations.

SoftTeacher/configs/soft_teacher/base.py

Line 265 in 863d90a

checkpoint_config = dict(by_epoch=False, interval=4000, max_keep_ckpts=20)

You can change the interval to a smaller number like 50 to see whether there is still no model saved.

purbayankar · 2021-09-22T09:17:43Z

@MendelXu Thank you very much. The issue is solved.

purbayankar · 2021-09-22T10:49:26Z

I am encountering another issue. Whenever I am trying to train using full labelled data setting using bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 2, I am getting the following error message:
FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/annotations/instances_unlabeled2017.json'. But I downloaded the dataset from the official website and I do not seem to find instances_unlabeled2017.json in the Downloads section.

MendelXu · 2021-09-22T11:36:35Z

@purbayankar you can download it from http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip and put the extracted json file in coco/annotations/. Then execute bash tools/dataset/prepare_coco_data.sh conduct.

duany049 · 2021-09-23T03:09:34Z

@purbayankar you can download it from http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip and put the extracted json file in coco/annotations/. Then execute bash tools/dataset/prepare_coco_data.sh conduct.

If I train model on the full labeled data setting, is it necessary for me to execute bash tools/dataset/prepare_coco_data.sh conduct(the percent in prepare_coco_data.sh belongs to {1%, 5%, 10%}, no 100%, so I don't execuete prepare_coco_data.sh, before I train full labeled data.)

purbayankar · 2021-09-23T05:20:17Z

@duany049 I'm not sure whether it is necessary or not but as per instruction of @MendelXu executing bash tools/dataset/prepare_coco_data.sh conduct solves the issue for me.

MendelXu · 2021-09-23T05:40:54Z

@duany049 It is still necessary.

duany049 · 2021-09-23T05:48:56Z

@duany049 It is still necessary.

If I added more data, but don't execute prepare_coco_data.sh, it means that I still train model with old data which is mall ?

MendelXu · 2021-09-23T06:30:18Z

"I still train model with old data which is mall ?" Do you want to say 'small'?

For different portions of data, we provide different config files. So if you are training with the related config file (like https://github.com/microsoft/SoftTeacher/blob/main/configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py), it should raise errors like @psvnlsaikumar. However, this is all about the COCO dataset. If you want to add external data, you have to edit the config file and change the dataset settings.

duany049 · 2021-09-23T06:41:46Z

Sorry, but what is "I still train model with old data which is mall ?"

Thank you for your reply.
I use A unlabeled dataset at the first, and executing prepare_coco_data.sh, then I use B unlabeled dataset, but not executing prepare_coco_data.sh, so I still use A unlabeled dataset now, right?

MendelXu · 2021-09-23T06:47:52Z

What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert image_info_unlabeled2017.json to instances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.

duany049 · 2021-09-23T06:51:46Z

What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert image_info_unlabeled2017.json to instances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.

Thank you for solving my question

MendelXu mentioned this issue Sep 22, 2021

fix data_time bug #28

Merged

MendelXu closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model training stops after validation after 4000 iterations #27

Model training stops after validation after 4000 iterations #27

purbayankar commented Sep 21, 2021

MendelXu commented Sep 21, 2021

purbayankar commented Sep 21, 2021

leeeeeng commented Sep 22, 2021

MendelXu commented Sep 22, 2021 •

edited

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

purbayankar commented Sep 22, 2021

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

duany049 commented Sep 23, 2021

purbayankar commented Sep 23, 2021

MendelXu commented Sep 23, 2021

duany049 commented Sep 23, 2021

MendelXu commented Sep 23, 2021 •

edited

duany049 commented Sep 23, 2021

MendelXu commented Sep 23, 2021

duany049 commented Sep 23, 2021

Model training stops after validation after 4000 iterations #27

Model training stops after validation after 4000 iterations #27

Comments

purbayankar commented Sep 21, 2021

MendelXu commented Sep 21, 2021

purbayankar commented Sep 21, 2021

leeeeeng commented Sep 22, 2021

MendelXu commented Sep 22, 2021 • edited

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

purbayankar commented Sep 22, 2021

purbayankar commented Sep 22, 2021

MendelXu commented Sep 22, 2021

duany049 commented Sep 23, 2021

purbayankar commented Sep 23, 2021

MendelXu commented Sep 23, 2021

duany049 commented Sep 23, 2021

MendelXu commented Sep 23, 2021 • edited

duany049 commented Sep 23, 2021

MendelXu commented Sep 23, 2021

duany049 commented Sep 23, 2021

MendelXu commented Sep 22, 2021 •

edited

MendelXu commented Sep 23, 2021 •

edited