-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug of corner case of proposals #1
Comments
Thanks for your interest in GRiT and for re-training it on VG. Do you know this error comes from "proposals[0].proposal_boxes.tensor[0, :]" or "targets[0].gt_boxes.tensor[0, :]"? If it is from the former one, I haven't met the case that there are no proposals. I think there should always be some proposals. Can you check if it is because there isn't any ground truth? It would be great if you can print out this line of code so as to determine whether the issue is from the proposal or the ground truth. |
Hello, I met the same issue here, any workaround now? |
Hello, I think the problem is on the proposal side. As shown in the code, the function 'check_if_all_background''' has been used twice and the error occurs at the second time. Because the ``targets'' doesn't change and the first-time usage works well, I think the issue arises on the proposal side, where no proposals are generated. |
Do you mean at the beginning of "_forward_box" function the "check_if_all_background" works fine? Once it enters the ROI head, the number of proposals shouldn't be changed regardless of which cascade stage it is at. |
Yes, I think the problem arises on the proposal side because the first usage seems fine. Do you mean the problem is caused by the wrong ground truth? This still looks strange because the ground truth is not modified in this function. Do you have any idea to solve this? This seems to be a common issue for running the objectDetect task: #5 (comment) |
The error comes from GT as empty instances: So can you share the COCO-json that you used with us? @JialianW |
We used the official annotations from the COCO website. The images without groundtruth should be already discarded as shown at GRiT/grit/data/datasets/grit_coco.py Line 93 in 39b33db
Can you post your config file? |
I also use the official json files from COCO and run this code without any modifications. |
The reason why the first call of "check_if_all_background" doesn't yield error may be it didn't enter "if all_background:". Probably the ground truth is empty from the beginning. In this case, maybe the groundtruth was removed when images were being augmented, where a background crop was fed into the model. Did you use our provided config file without any change? |
Yes, without any change. And this error also shows when other people run it. |
Can you make a change here to make sure the input image does have ground truth: GRiT/grit/data/custom_dataset_mapper.py Line 53 in 62ee07f
Can you add some codes after that line like: This is to ensure "self.prepare_data" does not empty ground truth when preparing data. |
okay, I will try it as you suggested. |
It looks fine by far. I will tell you later if it is fixed. |
The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed. |
Thanks for the update. I didn't add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training. |
When it goes to 2.2w iters, OOM (out of memory) error occurs. The occupation of GPU memory increases as training progresses, as shown below �[32m[03/16 10:20:04 d2.utils.events]: �[0m eta: 1 day, 23:52:09 iter: 80 total_loss: 7.998 loss_box_reg_stage0: 0.06707 loss_box_reg_stage1: 0.06987 loss_box_reg_stage2: 0.02389 loss_centernet_agn_neg: 0.05723 loss_centernet_agn_pos: 0.3456 loss_centernet_loc: 0.7329 loss_cls_stage0: 0.2401 loss_cls_stage1: 0.2133 loss_cls_stage2: 0.1666 loss_mask: 0.6922 text_decoder_loss: 5.463 time: 0.9160 last_time: 1.0366 data_time: 0.0180 last_data_time: 0.0144 lr: 4.878e-08 max_mem: 4452M [03/16 13:20:32 d2.utils.events]: �[0m eta: 2 days, 1:20:37 iter: 10460 total_loss: 2.685 loss_box_reg_stage0: 0.1964 loss_box_reg_stage1: 0.2311 loss_box_reg_stage2: 0.1316 loss_centernet_agn_neg: 0.04058 loss_centernet_agn_pos: 0.2017 loss_centernet_loc: 0.4001 loss_cls_stage0: 0.179 loss_cls_stage1: 0.159 loss_cls_stage2: 0.113 loss_mask: 0.4384 text_decoder_loss: 0.6743 time: 1.0258 last_time: 1.0547 data_time: 0.0179 last_data_time: 0.0608 lr: 7.687e-07 max_mem: 21002M �[32m[03/16 16:58:19 d2.utils.events]: �[0m eta: 1 day, 23:35:12 iter: 22300 total_loss: 2.63 loss_box_reg_stage0: 0.2285 loss_box_reg_stage1: 0.2795 loss_box_reg_stage2: 0.1659 loss_centernet_agn_neg: 0.04165 loss_centernet_agn_pos: 0.1809 loss_centernet_loc: 0.3561 loss_cls_stage0: 0.1952 loss_cls_stage1: 0.1732 loss_cls_stage2: 0.1347 loss_mask: 0.395 text_decoder_loss: 0.4412 time: 1.0581 last_time: 1.2151 data_time: 0.0211 last_data_time: 0.0035 lr: 7.4622e-07 max_mem: 37269M My experiments are run on 8xA100 GPUs. How many GPUs do you use for training? Or have you met this before? |
Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu. |
The above bug seems to be fixed, the results for 20000 th iter is � The results are trained with 8 x A100 cards, and can you share the results for the same checkpoint, so as to verify the bug is fixed? Besides, the training breaks at the 29980th iter with the following error: Traceback (most recent call last): Have you met this before? |
I didn't meet this error before. Looks like the error is from saving checkpoint. Was your previous checkpoint successfully saved? |
Yes, it saved successfully on both the 10000th and 20000th iters, thus it looks so wired. I can only found a similar issue here: pytorch/pytorch#42428 |
Pls refer to Installation instructions for our pytorch version. |
I have trained the model for description task on Visual Genome successfully. |
@Evenyyy Hello, I'm sorry to bother you. Could you please tell me more details about evaluating the vg_instances_results.json file, or share your code? Thank you very much! |
Thank you for your suggestion. I also encountered this issue. This problem has now been solved. |
Hi,
Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error
The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.
Would you mind checking it as I'm not familiar enough with this repo
The text was updated successfully, but these errors were encountered: