Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all-zero gt_boxes are loaded and cause runtime error #330

Closed
gardenbaby opened this issue Oct 25, 2020 · 11 comments
Closed

all-zero gt_boxes are loaded and cause runtime error #330

gardenbaby opened this issue Oct 25, 2020 · 11 comments
Labels
enhancement New feature or request to be closed

Comments

@gardenbaby
Copy link

Hi, team,

I noticed that in dataset.py the following code was added for preventing loading a new frame without gt_boxes.

if len(data_dict['gt_boxes']) == 0:
    new_index = np.random.randint(self.__len__())
    return self.__getitem__(new_index)

But it still happens that sometimes the gt_boxes in batch_dict for certain frames are all zeros before being processed by the model. Some of the operations (such as torch.max()) cannot be accomplished with the situation above.

Here I added the following code in point_rcnn.py at the beginning of forword(),

def forward(self, batch_dict):
    for gt_box in batch_dict['gt_boxes']:
        if gt_box.max() == gt_box.min() == 0:
            pdb.set_trace()
    ...

Everytime it stopped here, I printed batch_dict['gt_boxes'] and would find one of the frames with gt_boxes all zeros as the following.

(Pdb) gt_boxes = batch_dict['gt_boxes']
(Pdb) gt_boxes[0]
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')

As PointRCNN use PointResidualCoder, if gt_boxes is all-zero, no foreground gt_boxes will be selected and the input of PointResidualCoder will be empty. Then the following assert in encode_torch() will raise RuntimeError as follows.

assert gt_classes.max() <= self.mean_size.shape[0]

RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /opt/conda/conda-bld/pytorch_1587428270644/work/aten/src/THC/generic/THCTensorMathReduce.cu:85

Here I trained the model with CLASS_NAMES: ['Cyclist'].

Is it normal and what could be the possible reason behind?

@gardenbaby
Copy link
Author

I reviewed the code and found that the problem is from DataBaseSampler.

As there computes the IoUs of sampled boxes in __call__()(database_sampler.py) and only select boxes which are not overlapped with others, valid_mask(line 188) could be empty.

So when I choose only one class for training, such as Cyclist , even the following constrain in dataset.py (line 127) is satisfied, all the remained boxes could be anything but Cyclist.

 if len(data_dict['gt_boxes']) == 0:
     new_index = np.random.randint(self.__len__())
     return self.__getitem__(new_index)

When it runs to the following code in line 132 (dataset.py), selected will be empty and finally no gt_boxes can be selected.

selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
data_dict['gt_names'] = data_dict['gt_names'][selected]

So I changed the code in line 127 (dataset.py) from

if len(data_dict['gt_boxes']) == 0:
    ...

to

gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)
if gt_boxes_mask.sum() == 0:
    ...

It seems that it works.

@sshaoshuai
Copy link
Collaborator

@gardenbaby , thank you for the information and I will check that.

@sshaoshuai sshaoshuai added the enhancement New feature or request label Oct 27, 2020
@xjjs
Copy link

xjjs commented Oct 28, 2020

So when I choose only one class for training, such as Cyclist , even the following constrain in dataset.py (line 127) is satisfied, all the remained boxes could be anything but Cyclist.

 if len(data_dict['gt_boxes']) == 0:
     new_index = np.random.randint(self.__len__())
     return self.__getitem__(new_index)

if you choose only one class for training (Cyclist), then len(data_dict['gt_boxes']) == 0 means no gt_boxes for Cyclist.
However, the bin file in folder 'gt_database' could be empty (0 size), so I think if the original 'gt_boxes' contains no Cyclist, and the sampled bins are empty, then we meet error: assert gt_classes.max() <= self.mean_size.shape[0]

@xjjs
Copy link

xjjs commented Oct 29, 2020

So when I choose only one class for training, such as Cyclist , even the following constrain in dataset.py (line 127) is satisfied, all the remained boxes could be anything but Cyclist.

 if len(data_dict['gt_boxes']) == 0:
     new_index = np.random.randint(self.__len__())
     return self.__getitem__(new_index)

if you choose only one class for training (Cyclist), then len(data_dict['gt_boxes']) == 0 means no gt_boxes for Cyclist.
However, the bin file in folder 'gt_database' could be empty (0 size), so I think if the original 'gt_boxes' contains no Cyclist, and the sampled bins are empty, then we meet error: assert gt_classes.max() <= self.mean_size.shape[0]

Still get the runtime error with the following changement:

        if gt_boxes_mask.sum() == 0:
            new_index = np.random.randint(self.__len__())
            return self.__getitem__(new_index)

        data_dict = self.data_augmentor.forward(
            data_dict={
                **data_dict,
                'gt_boxes_mask': gt_boxes_mask
            }
        )
        # if len(data_dict['gt_boxes']) == 0:
        #     new_index = np.random.randint(self.__len__())
        #     return self.__getitem__(new_index)

@MartinHahner
Copy link
Contributor

@gardenbaby
I applied your fix in my codebase and I had no issues with it 👍

@gardenbaby
Copy link
Author

@xjjs

if you choose only one class for training (Cyclist), then len(data_dict['gt_boxes']) == 0 means no gt_boxes for Cyclist.
However, the bin file in folder 'gt_database' could be empty (0 size), so I think if the original 'gt_boxes' contains no Cyclist, and the sampled bins are empty, then we meet error: assert gt_classes.max() <= self.mean_size.shape[0]

I've checked gt_database and there exist samples for different categories. Later I found it may be caused by samples filtering in DatabaseSampler and the condition to reenable __getitem__() in DatasetTemplate. Please refer to my comment here for details.

Still get the runtime error with the following changement:

        if gt_boxes_mask.sum() == 0:
            new_index = np.random.randint(self.__len__())
            return self.__getitem__(new_index)

        data_dict = self.data_augmentor.forward(
            data_dict={
                **data_dict,
                'gt_boxes_mask': gt_boxes_mask
            }
        )
        # if len(data_dict['gt_boxes']) == 0:
        #     new_index = np.random.randint(self.__len__())
        #     return self.__getitem__(new_index)

Simply comment the last 3 lines may not help, since it's for recursively loading another frame if no gt_boxes item is found in the current one. My solution here only changed the "if" statement.

I replaced

if len(data_dict['gt_boxes']) == 0:

by the following 2 lines.

gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)
if gt_boxes_mask.sum() == 0:

Please note that I added a new gt_boxes_mask here to recheck gt_boxes of the selected category, which is different from the one in line 119 dataset.py.

@gardenbaby
Copy link
Author

@MartinHahner
Thanks for the recheck.

@xjjs
Copy link

xjjs commented Oct 30, 2020

@gardenbaby

gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)
if gt_boxes_mask.sum() == 0:


Please note that I added a new `gt_boxes_mask` here to recheck `gt_boxes` of the selected category, which is different from the one in line 119 [dataset.py](https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/datasets/dataset.py).

Thanks for your detailed reply. I have reviewed the code, and found the codes in database_sampler.py (line 196):

    if total_valid_sampled_dict.__len__() > 0:
        data_dict = self.add_sampled_boxes_to_scene(data_dict, sampled_gt_boxes, total_valid_sampled_dict

If the valid_mask (line 188) is empty, then the 'self.add_sampled_boxes_to_scene' will not be called, and the 'gt_boxes_mask' will not be used to update gt_boxes (line 120).

Thus the returned 'data_dict' in dataset.py (line 121) still keep gt_boxes for other categories.

Still, some of the bin files in gt_database folder is empty, which means the corresponding gtbox contains no valid points, and it can be filtered in the generating process.

@gardenbaby
Copy link
Author

@xjjs
Thanks for the information.

@xjjs
Copy link

xjjs commented Nov 3, 2020

@gardenbaby

With the code changement, fewer runtime error for "assert gt_classes.max() <= self.mean_size.shape[0]" happened. But I got one today, I think the problem may lie in the empty bins in gt_database folder, which contain no points but still can be sampled.

@sshaoshuai
Copy link
Collaborator

Hi all,

This bug has been fixed in #340.

Actually I moved the empty check to the end of the prepare_data function since data_processor could also modify the gt_boxes.

The error in PointResidualCoder is another bug and I also fixed it in this PR.

Note that both of these two errors don't affect the performance.

Thank you all for the bug information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request to be closed
Projects
None yet
Development

No branches or pull requests

4 participants