Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug of corner case of proposals #1

Open
jshilong opened this issue Jan 16, 2023 · 24 comments
Open

Bug of corner case of proposals #1

jshilong opened this issue Jan 16, 2023 · 24 comments

Comments

@jshilong
Copy link

Hi,
Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error

[01/16 12:04:41 d2.utils.events]:  eta: 1 day, 11:49:23  iter: 1360  total_loss: 2.975  loss_box_reg_stage0: 0.2477  loss_box_reg_stage1: 0.3255  loss_box_reg_stage2: 0.2068  loss_centernet_agn_neg: 0.0414  loss_centernet_agn_pos: 0.1851  loss_centernet_loc: 0.3947  loss_cls_stage0: 0.2062  loss_cls_stage1: 0.1867  loss_cls_stage2: 0.1439  loss_mask: 0.3913  text_decoder_loss: 0.6096  time: 0.7084  data_time: 0.0160  lr: 7.7501e-07  max_mem: 21398M
[01/16 12:04:42] grit.modeling.roi_heads.grit_roi_heads INFO: all proposals are background at stage 2
Traceback (most recent call last):
  File "train_deepspeed.py", line 263, in <module>
    launch_deepspeed(
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 67, in launch_deepspeed
    mp.spawn(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 133, in _distributed_worker
    main_func(*args)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 251, in main
    do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 175, in do_train
    loss_dict = model(data)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1656, in forward
    loss = self.module(*inputs, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/meta_arch/grit.py", line 59, in forward
    proposals, roihead_textdecoder_losses = self.roi_heads(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 302, in forward
    losses = self._forward_box(features, proposals, targets, task=targets_task)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 173, in _forward_box
    proposals = self.check_if_all_background(proposals, targets, k)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 142, in check_if_all_background
    proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
IndexError: index 0 is out of bounds for dimension 0 with size 0

The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.

Would you mind checking it as I'm not familiar enough with this repo

@jshilong jshilong changed the title Bug of corner case Bug of corner case of proposals Jan 16, 2023
@JialianW
Copy link
Owner

JialianW commented Jan 16, 2023

Thanks for your interest in GRiT and for re-training it on VG.

Do you know this error comes from "proposals[0].proposal_boxes.tensor[0, :]" or "targets[0].gt_boxes.tensor[0, :]"? If it is from the former one, I haven't met the case that there are no proposals. I think there should always be some proposals. Can you check if it is because there isn't any ground truth?

It would be great if you can print out this line of code so as to determine whether the issue is from the proposal or the ground truth.

@Solacex
Copy link

Solacex commented Mar 15, 2023

Hello, I met the same issue here, any workaround now?

@Solacex
Copy link

Solacex commented Mar 15, 2023

Hello, I think the problem is on the proposal side.

As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

@JialianW
Copy link
Owner

Hello, I think the problem is on the proposal side.

As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.

@Solacex
Copy link

Solacex commented Mar 15, 2023

Hello, I think the problem is on the proposal side.
As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated.

Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at.

Yes, I think the problem arises on the proposal side because the first usage seems fine. Do you mean the problem is caused by the wrong ground truth?

This still looks strange because the ground truth is not modified in this function. Do you have any idea to solve this? This seems to be a common issue for running the objectDetect task: #5 (comment)

@Solacex
Copy link

Solacex commented Mar 15, 2023

The error comes from GT as empty instances:
Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])

So can you share the COCO-json that you used with us? @JialianW

@JialianW
Copy link
Owner

The error comes from GT as empty instances: Instances(num_instances=0, image_height=1006, image_width=1024, fields=[gt_boxes: Boxes(tensor([], device='cuda:1', size=(0, 4))), gt_classes: tensor([], device='cuda:1', dtype=torch.int64), gt_masks: PolygonMasks(num_instances=0), gt_object_descriptions: ObjDescription([])])

So can you share the COCO-json that you used with us? @JialianW

We used the official annotations from the COCO website. The images without groundtruth should be already discarded as shown at

if len(record["annotations"]) == 0:

Can you post your config file?

@Solacex
Copy link

Solacex commented Mar 16, 2023

I also use the official json files from COCO and run this code without any modifications.
This error looks so wired with after excluding the null instances as you pointed out.

@JialianW
Copy link
Owner

The reason why the first call of "check_if_all_background" doesn't yield error may be it didn't enter "if all_background:". Probably the ground truth is empty from the beginning. In this case, maybe the groundtruth was removed when images were being augmented, where a background crop was fed into the model. Did you use our provided config file without any change?

@Solacex
Copy link

Solacex commented Mar 16, 2023

Yes, without any change. And this error also shows when other people run it.

@JialianW
Copy link
Owner

JialianW commented Mar 16, 2023

Can you make a change here to make sure the input image does have ground truth:

dataset_dict_out = self.prepare_data(dataset_dict)

Can you add some codes after that line like:
while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0:
dataset_dict_out = self.prepare_data(dataset_dict)

This is to ensure "self.prepare_data" does not empty ground truth when preparing data.

@Solacex
Copy link

Solacex commented Mar 16, 2023

okay, I will try it as you suggested.

@Solacex
Copy link

Solacex commented Mar 16, 2023

It looks fine by far. I will tell you later if it is fixed.

@Solacex
Copy link

Solacex commented Mar 16, 2023

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

@JialianW
Copy link
Owner

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.

@Solacex
Copy link

Solacex commented Mar 16, 2023

When it goes to 2.2w iters, OOM (out of memory) error occurs. The occupation of GPU memory increases as training progresses, as shown below

�[32m[03/16 10:20:04 d2.utils.events]: �[0m eta: 1 day, 23:52:09 iter: 80 total_loss: 7.998 loss_box_reg_stage0: 0.06707 loss_box_reg_stage1: 0.06987 loss_box_reg_stage2: 0.02389 loss_centernet_agn_neg: 0.05723 loss_centernet_agn_pos: 0.3456 loss_centernet_loc: 0.7329 loss_cls_stage0: 0.2401 loss_cls_stage1: 0.2133 loss_cls_stage2: 0.1666 loss_mask: 0.6922 text_decoder_loss: 5.463 time: 0.9160 last_time: 1.0366 data_time: 0.0180 last_data_time: 0.0144 lr: 4.878e-08 max_mem: 4452M
�[32m[03/16 10:20:24 d2.utils.events]: �[0m eta: 2 days, 0:05:33 iter: 100 total_loss: 6.563 loss_box_reg_stage0: 0.09858 loss_box_reg_stage1: 0.1065 loss_box_reg_stage2: 0.03283 loss_centernet_agn_neg: 0.03842 loss_centernet_agn_pos: 0.3317 loss_centernet_loc: 0.7042 loss_cls_stage0: 0.1995 loss_cls_stage1: 0.1513 loss_cls_stage2: 0.1101 loss_mask: 0.6907 text_decoder_loss: 4.111 time: 0.9317 last_time: 1.0245 data_time: 0.0176 last_data_time: 0.0070 lr: 6.4266e-08 max_mem: 4476M

[03/16 13:20:32 d2.utils.events]: �[0m eta: 2 days, 1:20:37 iter: 10460 total_loss: 2.685 loss_box_reg_stage0: 0.1964 loss_box_reg_stage1: 0.2311 loss_box_reg_stage2: 0.1316 loss_centernet_agn_neg: 0.04058 loss_centernet_agn_pos: 0.2017 loss_centernet_loc: 0.4001 loss_cls_stage0: 0.179 loss_cls_stage1: 0.159 loss_cls_stage2: 0.113 loss_mask: 0.4384 text_decoder_loss: 0.6743 time: 1.0258 last_time: 1.0547 data_time: 0.0179 last_data_time: 0.0608 lr: 7.687e-07 max_mem: 21002M
�[32m[03/16 13:20:53 d2.utils.events]: �[0m eta: 2 days, 1:20:39 iter: 10480 total_loss: 2.784 loss_box_reg_stage0: 0.2283 loss_box_reg_stage1: 0.2343 loss_box_reg_stage2: 0.1302 loss_centernet_agn_neg: 0.04495 loss_centernet_agn_pos: 0.2133 loss_centernet_loc: 0.3937 loss_cls_stage0: 0.1948 loss_cls_stage1: 0.1652 loss_cls_stage2: 0.1137 loss_mask: 0.4339 text_decoder_loss: 0.6384 time: 1.0259 last_time: 1.0150 data_time: 0.0162 last_data_time: 0.0061 lr: 7.6868e-07 max_mem: 21002M

�[32m[03/16 16:58:19 d2.utils.events]: �[0m eta: 1 day, 23:35:12 iter: 22300 total_loss: 2.63 loss_box_reg_stage0: 0.2285 loss_box_reg_stage1: 0.2795 loss_box_reg_stage2: 0.1659 loss_centernet_agn_neg: 0.04165 loss_centernet_agn_pos: 0.1809 loss_centernet_loc: 0.3561 loss_cls_stage0: 0.1952 loss_cls_stage1: 0.1732 loss_cls_stage2: 0.1347 loss_mask: 0.395 text_decoder_loss: 0.4412 time: 1.0581 last_time: 1.2151 data_time: 0.0211 last_data_time: 0.0035 lr: 7.4622e-07 max_mem: 37269M
�[32m[03/16 16:58:41 d2.utils.events]: �[0m eta: 1 day, 23:34:02 iter: 22320 total_loss: 2.535 loss_box_reg_stage0: 0.2358 loss_box_reg_stage1: 0.2703 loss_box_reg_stage2: 0.1736 loss_centernet_agn_neg: 0.044 loss_centernet_agn_pos: 0.1872 loss_centernet_loc: 0.3547 loss_cls_stage0: 0.1911 loss_cls_stage1: 0.1689 loss_cls_stage2: 0.131 loss_mask: 0.3955 text_decoder_loss: 0.4017 time: 1.0581 last_time: 1.2096 data_time: 0.0244 last_data_time: 0.0601 lr: 7.4617e-07 max_mem: 37269M

My experiments are run on 8xA100 GPUs.

How many GPUs do you use for training? Or have you met this before?

@JialianW
Copy link
Owner

Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu.

@Solacex
Copy link

Solacex commented Mar 17, 2023

The above bug seems to be fixed, the results for 20000 th iter is �
�[copypaste: AP,AP50,AP75,APs,APm,APl
�[03/17 09:13:19 d2.evaluation.testing]: �[ 11.6693,20.4049,11.5888,4.6322,12.3034,17.4644]

The results are trained with 8 x A100 cards, and can you share the results for the same checkpoint, so as to verify the bug is fixed?

Besides, the training breaks at the 29980th iter with the following error:

Traceback (most recent call last):
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/xxx//guangrui/gDeco/lauch_deepspeed.py", line 133, in _distributed_worker
main_func(*args)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 252, in main
do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
File "/xxx//guangrui/gDeco/train_deepspeed.py", line 209, in do_train
periodic_checkpointer.step(iteration)
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 416, in step
self.checkpointer.save(
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 106, in save
data[key] = obj.state_dict()
File "/xxx//anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in state_dict
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
File "/xxx/anaconda3/envs/guangrui_cuda11/lib/python3.8/site-packages/torch/optim/optimizer.py", line 120, in
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
KeyError: 139902493578720

Have you met this before?

@JialianW
Copy link
Owner

I didn't meet this error before. Looks like the error is from saving checkpoint. Was your previous checkpoint successfully saved?

@Solacex
Copy link

Solacex commented Mar 17, 2023

Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.

I can only found a similar issue here: pytorch/pytorch#42428
It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?

@JialianW
Copy link
Owner

JialianW commented Mar 17, 2023

Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired.

I can only found a similar issue here: pytorch/pytorch#42428 It seems to be an issue on the version of pytorch, so is the torch you used is < 1.6.0?

Pls refer to Installation instructions for our pytorch version.

@Wykay
Copy link

Wykay commented Mar 19, 2023

I have trained the model for description task on Visual Genome successfully.
My environmet building procedure follows the INSTALL.md

@hellowordo
Copy link

@Evenyyy Hello, I'm sorry to bother you. Could you please tell me more details about evaluating the vg_instances_results.json file, or share your code? Thank you very much!

@yubo97
Copy link

yubo97 commented Dec 30, 2023

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.

Can you make a change here to make sure the input image does have ground truth:

dataset_dict_out = self.prepare_data(dataset_dict)

Can you add some codes after that line like: while len(dataset_dict_out["instances"].gt_boxes.tensor) == 0: dataset_dict_out = self.prepare_data(dataset_dict)

This is to ensure "self.prepare_data" does not empty ground truth when preparing data.

Thank you for your suggestion. I also encountered this issue. This problem has now been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants