-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for images to contain zero true detections #1531
Conversation
The dashboards seem to be failing for all builds? I don't see any change in the history that might have caused this. Is there an issue with travis? |
Seems to be caused by the prebuilt PyTorch 1.3, which requires CUDA 10.1. It should be fixed in #1534. |
Maybe we can also add an augmentation param to prevent pipelines.transforms.RandomCrop from returning None when the image has no annotations left? |
fb0cb9d
to
8885bbb
Compare
@AAnoosheh my main concern in this PR is to increase the robustness of the models regardless of the underlying dataloading / training loop. This PR doesn't fix the mmdet training loop because I think the dataloader explicitly ignores empty images. Perhaps the fix the augmenter / data loader would be best addressed by a separate PR after this one is merged? |
6c95845
to
c0dd69e
Compare
@Erotemic Also another question: Do these changes still enforce a background-class loss on images without bounding boxes? Or do they just allow an image to pass through without error, but without computing a background loss? The codebase is confusing enough that I can't figure out what's actually going on. |
@AAnoosheh, This codebase is certainly complex, but its not insurmountable. Things are pretty well compartmentalized and most function have exactly one job (I give a lot of credit to @hellock et al for this disciplined design), which makes it possible to grok things in small chunks. To answer your question "do empty-gt batch items generate loss?": Yes. If you look in my If you want to ensure that the builtin mmdet trainer works with empty-gt, then I think you'll have to look in the datasets to ensure that no-gt items aren't skipped. You'll probably have to test that it works, but I think that's all that needs to be done. |
24521dd
to
22da3f4
Compare
Thanks for the pr. To test faster R-50 when there is no gt, I hack the program by inserting
at the begining of farward_train in two_stage.py, but the program fails. It seems there some are logic errors. For example, even there is no gt, the sampler of bbox head will still sample some proposals for training. As a result, cls_score.numel() is not zero and pos_inds is also not zero. Could you check and test it again? |
Hello! I forked your pull request and tried to train Cascade RCNN + Guided Anchoring. And I got next error. This error is 100% connected with background pictures, because when I changed flag
I guess, that this problem occurred not because of your PR, but because background pictures' problem is much deeper and there is still a lot of work to be done. |
@yhcao6 you are right. My faster_rcnn test is broken. Unfortunately it doesn't run on travis-CI because the RoIAlign forward implementation either needs a GPU or I did have a logic error in Correct me if I'm wrong, but I think the @LitvinchukAndrey Yes, there are probably several other code paths where empty bounding boxes will still cause a problem. Because this problem is pretty big it might make sense to fix it incrementally, which will help prevent the PRs from becoming monolithic. In this PR I'm trying to only focus on issues in models. Furthermore, I'm only fixing the code where I encounter the problem and I can write a unit tests that demonstrates that the problem is fixed. I'm explicitly not fixing the code in apis and datasets. Once we have unit tests demonstrating that the loss functions can handle empty truth, it should be easier to go back and fix those other components of the system. However, it does look like the problem you are encountering is in models/guided_anchor_head. Perhaps you can follow my examples in |
7178b53
to
c4e7cbb
Compare
Very excited for this PR to get merged. I have strange errors when I am trying to run training of the cascade RCNN on 32 GPUs that may be addressed by this PR. |
@Erotemic I tried to remove all features like Guided Anchoring and Libra RCNN and train vanilla Cascade RCNN and Faster RCNN. But with of both them I meet the same problem:
I debugged it and found that the problem is because
Or we need to fix
to
( |
@LitvinchukAndrey I am completely unable to reproduce your issue. I went as far as to use the mmdet train scripts to attempt to reproduce the issue. I used configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc0712.py (I dont have the COCO dataset on my machine) and modified ann_info['bboxes'] = np.empty(shape=(0, 4), dtype=ann_info['bboxes'].dtype)
ann_info['labels'] = np.empty(shape=(0), dtype=ann_info['labels'].dtype)
ann_info['bboxes_ignore'] = np.empty(shape=(0, 4), dtype=ann_info['bboxes_ignore'].dtype)
ann_info['labels_ignore'] = np.empty(shape=(0), dtype=ann_info['labels_ignore'].dtype) to I also added print debugging to print('gt_bboxes[{}] = {}'.format(i, gt_bboxes[i]))
print('gt_labels[{}] = {}'.format(i, gt_labels[i]))
print('gt_bboxes_ignore[i] = {}'.format(gt_bboxes_ignore[i]))
print('proposal_list[i].shape = {}'.format(proposal_list[i].shape))
print('sampling_result.bboxes.shape = {!r}'.format(sampling_result.bboxes.shape))
print('sampling_result.neg_bboxes.shape = {!r}'.format(sampling_result.neg_bboxes.shape))
print('sampling_result.neg_inds.shape = {!r}'.format(sampling_result.neg_inds.shape))
print('sampling_result.num_gts = {!r}'.format(sampling_result.num_gts))
print('sampling_result.pos_assigned_gt_inds = {!r}'.format(sampling_result.pos_assigned_gt_inds))
print('sampling_result.pos_bboxes = {!r}'.format(sampling_result.pos_bboxes))
print('sampling_result.pos_gt_bboxes = {!r}'.format(sampling_result.pos_gt_bboxes))
print('sampling_result.pos_gt_labels = {!r}'.format(sampling_result.pos_gt_labels))
print('sampling_result.pos_inds = {!r}'.format(sampling_result.pos_inds))
print('sampling_result.pos_is_gt = {!r}'.format(sampling_result.pos_is_gt))
print('sampling_result = {!r}'.format(sampling_result)) And got things like: gt_bboxes[0] = tensor([], device='cuda:0', size=(0, 4))
gt_labels[0] = tensor([], device='cuda:0', dtype=torch.int64)
gt_bboxes_ignore[i] = None
proposal_list[i].shape = torch.Size([2000, 5])
sampling_result.bboxes.shape = torch.Size([512, 4])
sampling_result.neg_bboxes.shape = torch.Size([512, 4])
sampling_result.neg_inds.shape = torch.Size([512])
sampling_result.num_gts = 0
sampling_result.pos_assigned_gt_inds = tensor([], device='cuda:0', dtype=torch.int64)
sampling_result.pos_bboxes = tensor([], device='cuda:0', size=(0, 4))
sampling_result.pos_gt_bboxes = tensor([], device='cuda:0', size=(0, 4))
sampling_result.pos_gt_labels = None
sampling_result.pos_inds = tensor([], device='cuda:0', dtype=torch.int64)
sampling_result.pos_is_gt = tensor([], device='cuda:0', dtype=torch.uint8)
sampling_result = <mmdet.core.bbox.samplers.sampling_result.SamplingResult object at 0x7faf26787510>
gt_bboxes[1] = tensor([], device='cuda:0', size=(0, 4))
gt_labels[1] = tensor([], device='cuda:0', dtype=torch.int64)
gt_bboxes_ignore[i] = None
proposal_list[i].shape = torch.Size([2000, 5])
sampling_result.bboxes.shape = torch.Size([512, 4])
sampling_result.neg_bboxes.shape = torch.Size([512, 4])
sampling_result.neg_inds.shape = torch.Size([512])
sampling_result.num_gts = 0
sampling_result.pos_assigned_gt_inds = tensor([], device='cuda:0', dtype=torch.int64)
sampling_result.pos_bboxes = tensor([], device='cuda:0', size=(0, 4))
sampling_result.pos_gt_bboxes = tensor([], device='cuda:0', size=(0, 4))
sampling_result.pos_gt_labels = None
sampling_result.pos_inds = tensor([], device='cuda:0', dtype=torch.int64)
sampling_result.pos_is_gt = tensor([], device='cuda:0', dtype=torch.uint8)
sampling_result = <mmdet.core.bbox.samplers.sampling_result.SamplingResult object at 0x7faeac7a64d0>
loss_bbox = {'loss_cls': tensor(3.0426, device='cuda:0', grad_fn=<MulBackward0>), 'acc': tensor([1.8555], device='cuda:0')} So, this experiment shows that the mmdet trainer does work with the current version of this PR. (Given that you set |
c4e7cbb
to
7aa143d
Compare
I've rebased on master, removed test_forward2 and consolidated it with test_forward. I also determined that my initial guess on how to handled The correct behavior seems to be creating an empty |
Really thanks for your hard work. I found another inconsistent config. To test OHEM Sampler when there is no gt, I insert the following code
at the begining of farward_train in two_stage.py, but the program fails. Could you have a check about that? |
@yhcao6 I think I've addressed the issue. When I was looking into it I found that I may have been setting assigned result slightly incorrectly. Previously I was setting AssignedResult.max_overlaps as an empty tensor when there were no truth boxes, however I believe it should be a 1D zero tensor with shape equal to the number of predicted boxes to indicate that no pred box had any overlap with the truth. While debugging this I had to inspect the contents of the AssignResult class often. To make this easier I added a While inspecting AssignResult, I also noticed that there seemed to be a a bug in I also added a doctest to Finally, I added standalone test for samplers in Please take a look and let me know if there are any other outstanding issues. |
return ', '.join(parts) | ||
|
||
def __repr__(self): | ||
devnice = self.__nice__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devnice --> device
self.max_overlaps = torch.cat( | ||
[self.max_overlaps.new_ones(self.num_gts), self.max_overlaps]) | ||
[self.max_overlaps.new_ones(len(gt_labels)), self.max_overlaps]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why num_gts
not equal to len(gt_labels)
?
|
||
if num_squares == 0 or num_gts == 0: | ||
# No predictions and/or truth, return empty assignment | ||
overlaps = approxs.new(num_gts, num_squares) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overlaps
initialization is not consistent.
In approx_max_iou_assigner: overlaps = approxs.new(num_gts, num_squares)
In max_iou_assigner: max_overlaps = overlaps.new_zeros((num_bboxes, ))
In point_assigner: max_overlaps = None
>>> assert tuple(x.shape) == (0, 1) | ||
""" | ||
if x.numel() == 0: | ||
num_trailing = reduce(mul, x.shape[1:], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce
and mul
introduce two extrac packages: functools
and operator
, there should be a better way to implement the multiply of x.shape[1:]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about x = x.flatten(1)
e60af17
to
f31e052
Compare
@yhcao6 I fixed the linting errors causing Travis to fail. Any chance this could get merged in the near future? |
@Erotemic Thanks for your contribution. This is an important and non-trivial improvement so that we have to be cautious. Now we are doing a final test and it is very near to be merged. |
@hellock I agree that it is best to be cautions. I think merging this change is likely to break something, albeit that something should be small considering that test cases do cover a large portion of this functionality. I've actually run into one of these issues recently: training MaskRCNN using segmentation masks breaks when there are no truth. The fix is simple, (just don't do mask loss if you have no masks) but I doubt this is the only other issue with enabling this feature. |
Yes the core logic for supporting empty gts is not likely to break since the coverage of test cases is quite good. We mainly verify if other parts like data pre-processing, logging are working well. I trained two models for person detection, w/ or w/o using images without person annotations. It is confirmed that those images are definitely used for training. The performance of the two models are similar, and I believe that hyper-parameters need to be tuned when using additional background images, which is out of the scope of this PR. Overall, this PR looks good to me now. I've also reproduced the errors for empty masks. You may push a quick fix. |
On a software note: I pushed up the fix for empty masks. On a research note: I'm not surprised that models trained with / without empty images are comparable. I think on average the negative cases in an image without any objects of interest won't provide SGD with much more information than images with objects of interest. However, it does open the possibility of finding and including truly difficult examples that will benefit SGD. I think that is where adding this feature will really shine in terms of improving PR / ROC curves. Also it makes mmdet more robust to unseen datasets which will often contain images without any annotations. (The main reason I'm interested in getting this merged in a timely fashion is because we want to let the users of our VIAME project to train models on their custom datasets, which will almost certainly contain empty images.) |
Yes hard negative mining is usually applied to a pool of background images in practice. I fixed the padding transform for empty masks and you may have a check. I will merge it if there is no further issues. |
@Erotemic Thanks for your enthusiasm and nice work! It finally got merged. |
https://github.com/open-mmlab/mmdetection/tree/v1.0.0 I use version 1.0.0 and added unlabeled background image training. There is no problem when using a single GPU. But when I try to train with multiple GPUs, when run the script "./tools/dist_train.sh ./configs/mask_rcnn_r50_fpn_1x.py 4", meet the problem Stuck here for hours, please help |
And the occupancy rate of multiple GPUs is 100% I did the tests myself. Once multi GPU training with the participation of unlabeled background images, it will get stuck |
@yangninghua |
* Allow for images to contain zero true detections * Allow for empty assignment in PointAssigner * Allow ApproxMaxIouAssigner to return an empty result * Fix CascadeRNN forward when entire batch has no truth * Correctly assign boxes to background when there is no truth * Fix assignment tests * Make flatten robust * Fix bbox loss with empty pred/truth * Fix logic error in BBoxHead.loss * Add tests for empty truth cases * tests faster rcnn empty forward * Skip roipool forward tests if torchvision is not installed * Add tests for bbox/anchor heads * Consolidate test_forward and test_forward2 * Fix assign_results.labels = None when gt_labels is given; Add test for this case * Fix OHEM Sampler with zero truth * remove xdev * resolve 3 reviews * Fix flake8 * refactoring * fix yaml format * add filter flag * minor fix * delete redundant code in load anno * fix flake8 errors * quick fix for empty truth with masks * fix yapf error * fix mask padding for empty masks Co-authored-by: Cao Yuhang <yhcao6@gmail.com> Co-authored-by: Kai Chen <chenkaidev@gmail.com>
* Allow for images to contain zero true detections * Allow for empty assignment in PointAssigner * Allow ApproxMaxIouAssigner to return an empty result * Fix CascadeRNN forward when entire batch has no truth * Correctly assign boxes to background when there is no truth * Fix assignment tests * Make flatten robust * Fix bbox loss with empty pred/truth * Fix logic error in BBoxHead.loss * Add tests for empty truth cases * tests faster rcnn empty forward * Skip roipool forward tests if torchvision is not installed * Add tests for bbox/anchor heads * Consolidate test_forward and test_forward2 * Fix assign_results.labels = None when gt_labels is given; Add test for this case * Fix OHEM Sampler with zero truth * remove xdev * resolve 3 reviews * Fix flake8 * refactoring * fix yaml format * add filter flag * minor fix * delete redundant code in load anno * fix flake8 errors * quick fix for empty truth with masks * fix yapf error * fix mask padding for empty masks Co-authored-by: Cao Yuhang <yhcao6@gmail.com> Co-authored-by: Kai Chen <chenkaidev@gmail.com>
* share train_batch_size from reader to model * update floor_divide * add makedirs for dumping config
How should I format a coco-style dataset JSON to take into account these "pure background" images ? |
When I went to train a CascadeRCNN on my dataset the loss computation failed when it loaded an image that had no truth boxes on it. I'm a bit surprised that this was an issue. Perhaps I'm using the library incorrectly? Does this library expect that there exists some magic negative bounding box in cases where there are really no objects of interest n an image?
If this is indeed a real issue, I think I fixed it. I also added tests to ensure that these corner cases don't break in the future. The main issue was in
MaxIoUAssigner
, which explicitly disallowed both the number of predicted boxes to be zero and the number of truth boxes to be zero. I modified the code so it instead returns an appropriate empty assignment if either truth or predictions have no boxes.There was also an issue in bbox_head, where it asserted that all images had truth. I simply removed this check, and I believe the rest of the code still functions correctly (but it would be good if someone could double check this).
Lastly I added some docs to AssignResult to make it clear what the object contains.
EDIT 2019-10-18: I also fixed the ApproxMaxIoUAssigner and PointAssigner and added corresponding tests cases.
There were issues in CascadeRCNN, where it would crash when trying to RoiAlign the assigned ROIs in the case where the assignment was empty. I simply added some logic to skip that step, which is the correct thing to do.
There was an issue in loss_bbox, where it failed to compute a bbox loss when all boxes are assigned to the background. Again, the fix for this case is a simple check and skipping the computation of that loss term.
There was also an issue in my previous code where if no truth was made all predicted boxes got a gt_ind of -1, which means "dont care". I fixed this so now they correctly get assigned to 0, which means background.
Lastly, I added two tests to make sure cascade rcnn could compute losses for batches that had no truth boxes. I also added a test case for AnchorHead loss to ensure it computes background loss correctly in the case where the batch has no truth.
EDIT 2019-10-21: I found another edge case in convfc_bbox_head, where
x.view(x.shape[0], -1)
raised an error when x.shape[0] was 0. I added a function_view_flat_trailing_dims
which tests for and handles this case.EDIT 2019-10-29: I fixed a logic error where I wrote
pos_inds.numel()
instead ofpos_inds.any()
, rebased on master, and added tests for BBoxHead.