Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training stops after validation after 4000 iterations #27

Closed
purbayankar opened this issue Sep 21, 2021 · 19 comments
Closed

Model training stops after validation after 4000 iterations #27

purbayankar opened this issue Sep 21, 2021 · 19 comments

Comments

@purbayankar
Copy link

After training for 4000 iterations the validation happens and after that the training stops throwing the following error:

raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_05:54:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2210236)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I am training with 2 gpus. Do you have any insight why this error is being thrown?

@MendelXu
Copy link
Collaborator

There lacks enough error information. I have no idea what the error is. Could you post the entire log here?

@purbayankar
Copy link
Author

Thank you very much for the prompt reply. I am sharing the entire error log as you asked:

2021-09-22 08:08:54,500 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations
2021-09-22 08:09:01,005 - mmdet.ssod - INFO - Exp name: soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py
2021-09-22 08:09:01,005 - mmdet.ssod - INFO - Iter(val) [4000]  ema_momentum: 0.9990, sup_loss_rpn_cls: 0.2146, sup_loss_rpn_bbox: 0.1890, sup_loss_cls: 0.5416, sup_acc: 81.9336, sup_loss_bbox: 0.5341, unsup_loss_rpn_cls: 0.1201, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0561, unsup_acc: 99.8535, unsup_loss_bbox: 0.0753, loss: 1.7308
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 49.7 task/s, elapsed: 101s, ETA:     0s2021-09-22 08:10:44,352 - mmdet.ssod - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.11s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=17.67s).
Accumulating evaluation results...
DONE (t=4.36s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.015
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.036
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.008
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.021
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.046
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.021
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.051
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.050
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 49.0 task/s, elapsed: 102s, ETA:     0sTraceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/base.py", line 152, in after_train_iter
    self.log(runner)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 177, in log
    self._log_info(log_dict, runner)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/hooks/logger/text.py", line 97, in _log_info
    f'data_time: {log_dict["data_time"]:.3f}, '
KeyError: 'data_time'
2021-09-22 08:12:52,009 - mmdet.ssod - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=0.12s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2253381 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2253382) of binary: /home/pankaj/anaconda3/envs/py39/bin/python
Traceback (most recent call last):
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pankaj/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_08:12:59
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2253382)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

@leeeeeng
Copy link

I met this problem and solved it by installing mmcv=1.3.9 as the information in readme document. The problem happened when the version of mmcv is 1.3.12.

I can't find the exact problem, but using a lower version of mmcv really works for me.

@MendelXu
Copy link
Collaborator

MendelXu commented Sep 22, 2021

I find some code in the new commit causes the problem but have no idea how to solve it properly now. So I will revert to the old version...
It should be fixed now. In my local machine, I have tested with mmcv 1.3.9 and mmcv 1.3.12 for 8000 iterations, and it works fine. Could you update the code and have a try?

@purbayankar
Copy link
Author

Thank you very much @leeeeeng and @MendelXu. I got the error resolved following your suggestions. As I am currently using 2 gpus it shows 3 days for complete training. Can you please tell me in which folder model weights will be saved after training is completed?

@MendelXu
Copy link
Collaborator

By default, the weights are stored in work_dirs/{CFG_NAME}{PERCENT}/{FOLD}. If you are running the code with option --work-dir {WORK_DIR}, the weights are stored in {WORK_DIR}. By the way, as you are using 2 GPU, have you tried to adjust the learning rate (for example, reduce it to 1/4 of the original one)? The default learning rate may be too large for a smaller batch size.

@purbayankar
Copy link
Author

@MendelXu thank you very much for the prompt reply. In the work_dirs/{CFG_NAME}{PERCENT}/{FOLD} folder I can only see .log and .log.json files.But I can't see any .pth file. Does the weights get saved after the complete 180k iterations are completed? And thank you very much for your suggestion. I am using a smaller learning rate as my batch size is small.

@MendelXu
Copy link
Collaborator

No. The weight is stored every 4000 iterations.

checkpoint_config = dict(by_epoch=False, interval=4000, max_keep_ckpts=20)

You can change the interval to a smaller number like 50 to see whether there is still no model saved.

@purbayankar
Copy link
Author

@MendelXu Thank you very much. The issue is solved.

@purbayankar
Copy link
Author

I am encountering another issue. Whenever I am trying to train using full labelled data setting using bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 2, I am getting the following error message:
FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/annotations/instances_unlabeled2017.json'. But I downloaded the dataset from the official website and I do not seem to find instances_unlabeled2017.json in the Downloads section.

@MendelXu
Copy link
Collaborator

@purbayankar you can download it from http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip and put the extracted json file in coco/annotations/. Then execute bash tools/dataset/prepare_coco_data.sh conduct.

@duany049
Copy link

@purbayankar you can download it from http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip and put the extracted json file in coco/annotations/. Then execute bash tools/dataset/prepare_coco_data.sh conduct.

If I train model on the full labeled data setting, is it necessary for me to execute bash tools/dataset/prepare_coco_data.sh conduct(the percent in prepare_coco_data.sh belongs to {1%, 5%, 10%}, no 100%, so I don't execuete prepare_coco_data.sh, before I train full labeled data.)

@purbayankar
Copy link
Author

@duany049 I'm not sure whether it is necessary or not but as per instruction of @MendelXu executing bash tools/dataset/prepare_coco_data.sh conduct solves the issue for me.

@MendelXu
Copy link
Collaborator

@duany049 It is still necessary.

@duany049
Copy link

@duany049 It is still necessary.

If I added more data, but don't execute prepare_coco_data.sh, it means that I still train model with old data which is mall ?

@MendelXu
Copy link
Collaborator

MendelXu commented Sep 23, 2021

"I still train model with old data which is mall ?" Do you want to say 'small'?

For different portions of data, we provide different config files. So if you are training with the related config file (like https://github.com/microsoft/SoftTeacher/blob/main/configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py), it should raise errors like @psvnlsaikumar. However, this is all about the COCO dataset. If you want to add external data, you have to edit the config file and change the dataset settings.

@duany049
Copy link

Sorry, but what is "I still train model with old data which is mall ?"

Thank you for your reply.
I use A unlabeled dataset at the first, and executing prepare_coco_data.sh, then I use B unlabeled dataset, but not executing prepare_coco_data.sh, so I still use A unlabeled dataset now, right?

@MendelXu
Copy link
Collaborator

What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert image_info_unlabeled2017.json to instances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.

@duany049
Copy link

What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert image_info_unlabeled2017.json to instances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.

Thank you for solving my question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants