Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channels error when trying to train with 360 degree point cloud data (Argoverse) #44

Closed
chayangkultan96 opened this issue Apr 16, 2020 · 7 comments

Comments

@chayangkultan96
Copy link

chayangkultan96 commented Apr 16, 2020

Hi Sshaoshuai,

Thanks for the great work! I am currently running into an error trying to train with surround point cloud data. I am using the argoverse dataset that has been converted to the same format as KITTI data, all labels have been put in camera coordinate frames and calibration files have been extracted. I've created a config file and redefined all the mean sizes and anchor points. I also created a dataloader and was able to generate the pkl and groundtruth databases.

I am currently able to train with the dataset only when I limit point cloud range to [0, -40, -3, 70.4, 40, 1] (as provided). When I expand point cloud range to [-80,-80,-10, 80, 80, 10] I get the following error:

I am training using a docker image with 8 V100 GPUs.

Thanks!

UPDATE: I had to increase the number of input features to get it to work, is this the correct approach?

File "train.py", line 155, in
main()
File "train.py", line 148, in main
max_ckpt_save_num=args.max_ckpt_save_num
File "/s/dat/UserFolders/ctan24/PCDet-master/tools/train_utils/train_utils.py", line 81, in train_model
leave_pbar=(cur_epoch + 1 == total_epochs)
File "/s/dat/UserFolders/ctan24/PCDet-master/tools/train_utils/train_utils.py", line 36, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/init.py", line 25, in model_func
ret_dict, tb_dict, disp_dict = model(input_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/detectors/PartA2_net.py", line 106, in forward
rpn_ret_dict = self.forward_rpn(**input_dict)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/detectors/PartA2_net.py", line 38, in forward_rpn
**{'gt_boxes': kwargs.get('gt_boxes', None)}
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/bbox_heads/rpn_head.py", line 292, in forward
x = self.blocksi
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/model_utils/pytorch_utils.py", line 88, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 128 256 3 3, expected input[1, 1536, 252, 252] to have 256 channels, but got 1536 channels instead

A separate issue that I ran into when training with limited to image FOV is:
UPDATE: Seems like this issue only occur when I train with multiple GPU, I am able to train successfully with one GPU.

Traceback (most recent call last):
File "train.py", line 155, in
main()
File "train.py", line 148, in main
max_ckpt_save_num=args.max_ckpt_save_num
File "/s/dat/UserFolders/ctan24/PCDet-master/tools/train_utils/train_utils.py", line 81, in train_model
leave_pbar=(cur_epoch + 1 == total_epochs)
File "/s/dat/UserFolders/ctan24/PCDet-master/tools/train_utils/train_utils.py", line 36, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/init.py", line 25, in model_func
ret_dict, tb_dict, disp_dict = model(input_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/detectors/PartA2_net.py", line 112, in forward
batch_size, voxel_centers, coords, rpn_ret_dict, input_dict
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/detectors/PartA2_net.py", line 98, in forward_rcnn
rcnn_ret_dict = self.rcnn_net.forward(rcnn_input_dict)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/rcnn/partA2_rcnn_net.py", line 547, in forward
targets_dict = self.assign_targets(batch_size, rcnn_dict)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/rcnn/partA2_rcnn_net.py", line 27, in assign_targets
targets_dict = proposal_target_layer(rcnn_dict, roi_sampler_cfg=self.rcnn_target_config)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/model_utils/proposal_target_layer.py", line 14, in proposal_target_layer
sample_rois_for_rcnn(rois, gt_boxes, roi_raw_scores, roi_labels, roi_sampler_cfg)
File "/s/dat/UserFolders/ctan24/PCDet-master/pcdet/models/model_utils/proposal_target_layer.py", line 141, in sample_rois_for_rcnn
gt_of_bg_rois = cur_gt[gt_assignment[bg_inds]]
IndexError: index is out of bounds for dimension with size 0

@sshaoshuai
Copy link
Collaborator

(1) For the training of SECOND/PartA2 configurations, you should make sure your voxelized channels in the height direction shoud match with MODEL.RPN.RPN_HEAD.ARGS['num_input_features'], or you just simply make sure your voxelized height channels should be 40, such as in my default config the channels should be '(1 - (-3)) / 0.1 = 40'.
(2) Is there any scenes that don't have any ground truth boxes in your training data? If so, maybe you should update the dataloader to make sure the input training data has at least one ground truth box.

@chayangkultan96
Copy link
Author

chayangkultan96 commented Apr 17, 2020

(1) For the training of SECOND/PartA2 configurations, you should make sure your voxelized channels in the height direction shoud match with MODEL.RPN.RPN_HEAD.ARGS['num_input_features'], or you just simply make sure your voxelized height channels should be 40, such as in my default config the channels should be '(1 - (-3)) / 0.1 = 40'.
(2) Is there any scenes that don't have any ground truth boxes in your training data? If so, maybe you should update the dataloader to make sure the input training data has at least one ground truth box.

Thanks for your reply. Just a quick follow up question:

(1) Could you elaborate a little more on why 40 is the chosen number and how it eventually mapped to num_input_features which was set to 256 in the default setting? Maybe I'm missing something. How would I go about making sure that they are equal for my setting?

(2) Good point, there are some scenes without ground truth data, I'll make the changes and see how it works.

@sshaoshuai
Copy link
Collaborator

You could refer to the code here https://github.com/sshaoshuai/PCDet/blob/master/pcdet/models/rpn/rpn_unet.py#L484 for the mapping to the BEV feature channels, which is more clear. Just carefully set the height range and voxel size to make sure it has 40 levels after voxelization.

@chayangkultan96
Copy link
Author

chayangkultan96 commented Apr 20, 2020

Thanks I understand it now. @sshaoshuai Another question not related to this issue. Does other part of the code except for the dataloader depends on KITTI label coordinate? i.e. xyz centers being defined as bottom of object instead of true center. What is the convention for xyz through out the code?

Thanks!

@chayangkultan96
Copy link
Author

(1) For the training of SECOND/PartA2 configurations, you should make sure your voxelized channels in the height direction shoud match with MODEL.RPN.RPN_HEAD.ARGS['num_input_features'], or you just simply make sure your voxelized height channels should be 40, such as in my default config the channels should be '(1 - (-3)) / 0.1 = 40'.
(2) Is there any scenes that don't have any ground truth boxes in your training data? If so, maybe you should update the dataloader to make sure the input training data has at least one ground truth box.

Also, I rechecked my dataset, there are no frames that does not have GT data, I'm still occasionally getting this error, any thoughts on this? Thanks!

@sshaoshuai
Copy link
Collaborator

I do not know why it happens with these information, maybe you could try to catch the bugs and print the variables here.

@bugerry87
Copy link

Converting the Argoverse dataset to Kitti format?
Hm... while I try to get through this mess of dicts and pickles.
This data management requires some refactoring.

Hope I can contribute something soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants