Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CaDDN Detector #538

Merged
merged 7 commits into from May 20, 2021
Merged

CaDDN Detector #538

merged 7 commits into from May 20, 2021

Conversation

codyreading
Copy link
Contributor

@codyreading codyreading commented May 14, 2021

Summary

CaDDN is a monocular 3D object detection method that estimates categorical depth distributions in order to generate 3D feature representations for 3D object detection. It has been accepted in CVPR 2021 as an oral submission.
Paper: https://arxiv.org/abs/2103.01100
Code: https://github.com/TRAILab/CaDDN

Changes

  • Updated kitti_dataset.py and dataset.py to support image, depth map, and 2D GT box loading
  • Added GET_ITEM_LIST to specify which data items to load
  • Added image data augmentation: random_flip_horizontal
  • Added CaDDN detector
  • Added kornia and torchvision requirements
  • Addded modules:
    • DepthFFE: Frustum feature extractor via depth distribution estimation
    • DDNDeepLabV3/DDNTemplate`: Estimate depth distributions
    • DDNLoss: Loss for DDN
    • FrustumToVoxel: Transforms frustum to voxel grid
    • FrustumGridGenerator: Generates frustum sampling grid
    • Sampler: Samples the frustum grid
    • Conv2DCollapse: Collapses voxel grid to BEV via concat. + 1x1 conv.
    • Balancer: Loss balancer for foreground/background pixels
    • BasicBlock2D: Conv2D + Bn + Relu block
  • Added functions:
    • calib_to_matricies: Generate transformation matricies from calib objects
    • calculate_grid_size: Calculate grid_size without VoxelGenerator
    • get_pad_params: Get padding parameters for image padding
    • bin_depths: Converts depth map into depth bin indices
    • normalize_coords: Normalize grid coordinates between [-1, 1]
    • compute_fg_mask: Compute foreground pixel mask for images based on 2D GT boxes
    • project_to_image: Project 3D points to the image via projection matricies using Pytorch

Results

Car AP@0.70, 0.70, 0.70:
bbox AP:89.9449, 80.0868, 78.7468
bev  AP:34.8573, 25.5907, 24.0973
3d   AP:27.7777, 21.3760, 18.6217
aos  AP:89.06, 78.95, 77.00
Car AP_R40@0.70, 0.70, 0.70:
bbox AP:95.1921, 82.6336, 77.4336
bev  AP:31.6678, 21.5871, 19.4323
3d   AP:23.7724, 16.0700, 13.6146
aos  AP:94.14, 81.31, 75.67
Car AP@0.70, 0.50, 0.50:
bbox AP:89.9449, 80.0868, 78.7468
bev  AP:62.3596, 46.0990, 44.8178
3d   AP:57.9290, 43.5075, 37.7651
aos  AP:89.06, 78.95, 77.00
Car AP_R40@0.70, 0.50, 0.50:
bbox AP:95.1921, 82.6336, 77.4336
bev  AP:62.5936, 46.1427, 42.2161
3d   AP:57.0378, 40.7755, 36.9172
aos  AP:94.14, 81.31, 75.67
Pedestrian AP@0.50, 0.50, 0.50:
bbox AP:47.8124, 40.4024, 37.0082
bev  AP:16.9941, 13.7987, 12.6136
3d   AP:15.4504, 13.0160, 11.8772
aos  AP:35.24, 29.69, 27.19
Pedestrian AP_R40@0.50, 0.50, 0.50:
bbox AP:46.6209, 39.5637, 33.1653
bev  AP:11.8095, 8.9009, 7.0719
3d   AP:10.0425, 7.2711, 5.7442
aos  AP:32.20, 26.81, 22.44
Pedestrian AP@0.50, 0.25, 0.25:
bbox AP:47.8124, 40.4024, 37.0082
bev  AP:33.0742, 27.3150, 22.2747
3d   AP:32.9383, 26.2553, 21.9029
aos  AP:35.24, 29.69, 27.19
Pedestrian AP_R40@0.50, 0.25, 0.25:
bbox AP:46.6209, 39.5637, 33.1653
bev  AP:29.8401, 23.4260, 19.0682
3d   AP:29.4945, 22.8943, 17.8585
aos  AP:32.20, 26.81, 22.44
Cyclist AP@0.50, 0.50, 0.50:
bbox AP:35.4436, 24.0008, 22.9112
bev  AP:11.1946, 9.8259, 9.8259
3d   AP:10.8464, 9.7608, 9.0909
aos  AP:28.58, 19.83, 19.19
Cyclist AP_R40@0.50, 0.50, 0.50:
bbox AP:32.0066, 20.0532, 18.7363
bev  AP:3.0830, 1.7541, 1.5551
3d   AP:2.7691, 1.4875, 1.2074
aos  AP:24.12, 14.54, 13.68
Cyclist AP@0.50, 0.25, 0.25:
bbox AP:35.4436, 24.0008, 22.9112
bev  AP:16.6019, 11.9957, 11.9923
3d   AP:16.3234, 11.8802, 11.9318
aos  AP:28.58, 19.83, 19.19
Cyclist AP_R40@0.50, 0.25, 0.25:
bbox AP:32.0066, 20.0532, 18.7363
bev  AP:12.2843, 6.3071, 5.8288
3d   AP:11.4585, 5.8544, 5.5210
aos  AP:24.12, 14.54, 13.68

@codyreading
Copy link
Contributor Author

codyreading commented May 14, 2021

Tested PointPillar inference to ensure no changes with following command on a Titan XP:
python test.py --cfg_file cfgs/kitti_models/pointpillar.yaml --batch_size 16 --ckpt ../checkpoints/pointpillar_7728.pth

Performance

Master:
Max GPU Memory Usage: 5952 MB
Max Memory Usage: 2802 MB
Total Time: 1:39
Avg Iteration Time: 2.38it/s

feature/CaDDN:
Max GPU Memory Usage: 5952 MB
Max Memory Usage: 2831 MB
Total Time: 1:40
Avg Iteration Time: 2.36 it/s

Results

Master:

Car AP@0.70, 0.70, 0.70:
bbox AP:90.7786, 89.8062, 88.7936
bev  AP:89.6590, 87.1725, 84.3762
3d   AP:86.4617, 77.2839, 74.6530
aos  AP:90.77, 89.61, 88.47
Car AP_R40@0.70, 0.70, 0.70:
bbox AP:95.6607, 92.2403, 91.3167
bev  AP:92.0399, 88.0556, 86.6625
3d   AP:87.7518, 78.3964, 75.1843
aos  AP:95.64, 92.03, 90.97
Car AP@0.70, 0.50, 0.50:
bbox AP:90.7786, 89.8062, 88.7936
bev  AP:90.7894, 90.1848, 89.4635
3d   AP:90.7894, 90.0675, 89.2495
aos  AP:90.77, 89.61, 88.47
Car AP_R40@0.70, 0.50, 0.50:
bbox AP:95.6607, 92.2403, 91.3167
bev  AP:95.6987, 94.7077, 93.9983
3d   AP:95.6874, 94.3709, 93.4244
aos  AP:95.64, 92.03, 90.97
Pedestrian AP@0.50, 0.50, 0.50:
bbox AP:66.5436, 62.4922, 59.3026
bev  AP:61.6348, 56.2747, 52.6007
3d   AP:57.7500, 52.2916, 47.9072
aos  AP:48.63, 45.62, 42.93
Pedestrian AP_R40@0.50, 0.50, 0.50:
bbox AP:66.5852, 62.4351, 58.8016
bev  AP:61.5971, 56.0143, 52.0457
3d   AP:57.3015, 51.4145, 46.8715
aos  AP:45.89, 42.99, 40.03
Pedestrian AP@0.50, 0.25, 0.25:
bbox AP:66.5436, 62.4922, 59.3026
bev  AP:72.5064, 69.5191, 66.4626
3d   AP:72.4368, 69.3244, 65.3180
aos  AP:48.63, 45.62, 42.93
Pedestrian AP_R40@0.50, 0.25, 0.25:
bbox AP:66.5852, 62.4351, 58.8016
bev  AP:73.8776, 70.4969, 66.6494
3d   AP:73.7943, 70.2258, 66.0435
aos  AP:45.89, 42.99, 40.03
Cyclist AP@0.50, 0.50, 0.50:
bbox AP:85.2661, 72.9744, 68.9914
bev  AP:82.2593, 66.1110, 62.5585
3d   AP:80.0483, 62.6080, 59.5260
aos  AP:84.72, 71.09, 67.13
Cyclist AP_R40@0.50, 0.50, 0.50:
bbox AP:88.5723, 74.0385, 69.8009
bev  AP:85.2585, 66.2439, 62.2173
3d   AP:81.5670, 62.8074, 58.8314
aos  AP:87.91, 71.98, 67.81
Cyclist AP@0.50, 0.25, 0.25:
bbox AP:85.2661, 72.9744, 68.9914
bev  AP:86.6035, 70.6055, 66.9244
3d   AP:86.6035, 70.6055, 66.9244
aos  AP:84.72, 71.09, 67.13
Cyclist AP_R40@0.50, 0.25, 0.25:
bbox AP:88.5723, 74.0385, 69.8009
bev  AP:88.8812, 71.7453, 67.7714
3d   AP:88.8812, 71.7453, 67.7714
aos  AP:87.91, 71.98, 67.81

feature/CaDDN:

Car AP@0.70, 0.70, 0.70:
bbox AP:90.7786, 89.8062, 88.7936
bev  AP:89.6590, 87.1725, 84.3762
3d   AP:86.4617, 77.2839, 74.6530
aos  AP:90.77, 89.61, 88.47
Car AP_R40@0.70, 0.70, 0.70:
bbox AP:95.6607, 92.2403, 91.3167
bev  AP:92.0399, 88.0556, 86.6625
3d   AP:87.7518, 78.3964, 75.1843
aos  AP:95.64, 92.03, 90.97
Car AP@0.70, 0.50, 0.50:
bbox AP:90.7786, 89.8062, 88.7936
bev  AP:90.7894, 90.1848, 89.4635
3d   AP:90.7894, 90.0675, 89.2495
aos  AP:90.77, 89.61, 88.47
Car AP_R40@0.70, 0.50, 0.50:
bbox AP:95.6607, 92.2403, 91.3167
bev  AP:95.6987, 94.7077, 93.9983
3d   AP:95.6874, 94.3709, 93.4244
aos  AP:95.64, 92.03, 90.97
Pedestrian AP@0.50, 0.50, 0.50:
bbox AP:66.5436, 62.4922, 59.3026
bev  AP:61.6348, 56.2747, 52.6007
3d   AP:57.7500, 52.2916, 47.9072
aos  AP:48.63, 45.62, 42.93
Pedestrian AP_R40@0.50, 0.50, 0.50:
bbox AP:66.5852, 62.4351, 58.8016
bev  AP:61.5971, 56.0143, 52.0457
3d   AP:57.3015, 51.4145, 46.8715
aos  AP:45.89, 42.99, 40.03
Pedestrian AP@0.50, 0.25, 0.25:
bbox AP:66.5436, 62.4922, 59.3026
bev  AP:72.5064, 69.5191, 66.4626
3d   AP:72.4368, 69.3244, 65.3180
aos  AP:48.63, 45.62, 42.93
Pedestrian AP_R40@0.50, 0.25, 0.25:
bbox AP:66.5852, 62.4351, 58.8016
bev  AP:73.8776, 70.4969, 66.6494
3d   AP:73.7943, 70.2258, 66.0435
aos  AP:45.89, 42.99, 40.03
Cyclist AP@0.50, 0.50, 0.50:
bbox AP:85.2661, 72.9744, 68.9914
bev  AP:82.2593, 66.1110, 62.5585
3d   AP:80.0483, 62.6080, 59.5260
aos  AP:84.72, 71.09, 67.13
Cyclist AP_R40@0.50, 0.50, 0.50:
bbox AP:88.5723, 74.0385, 69.8009
bev  AP:85.2585, 66.2439, 62.2173
3d   AP:81.5670, 62.8074, 58.8314
aos  AP:87.91, 71.98, 67.81
Cyclist AP@0.50, 0.25, 0.25:
bbox AP:85.2661, 72.9744, 68.9914
bev  AP:86.6035, 70.6055, 66.9244
3d   AP:86.6035, 70.6055, 66.9244
aos  AP:84.72, 71.09, 67.13
Cyclist AP_R40@0.50, 0.25, 0.25:
bbox AP:88.5723, 74.0385, 69.8009
bev  AP:88.8812, 71.7453, 67.7714
3d   AP:88.8812, 71.7453, 67.7714
aos  AP:87.91, 71.98, 67.81

Copy link
Collaborator

@sshaoshuai sshaoshuai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, great work!
Welcome the first mono 3d det work CaDDN in OpenPCDet!

Please check the comments and see how we could further improve it to be more elegant.
Thank you!

import numpy as np


def random_flip_horizontal(image, depth_map, gt_boxes, calib):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this to augmentor_utils.py with function name random_image_flip_horizontal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if "calib_matricies" in self.dataset_cfg.GET_ITEM_LIST:
input_dict["trans_lidar_to_cam"], input_dict["trans_cam_to_img"] = kitti_utils.calib_to_matricies(calib)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET_ITEM_LIST is a good idea for various data sources.
However, points=self.get_lidar() is a common setting for LiDAR-based 3D object detection, so I think it should be kept by default to ensure previous configs could also use the KittiDataset class.

This part should be something like,

get_item_list = self.dataset_cfg.get('GET_ITEM_LIST', ['points'])

# load points 
if 'points' in get_item_list: 
   xxx
# load images
xxxx
# load depth_maps
xxxx
# load calib_matricies
xxxx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 33 to 36
self.voxel_grid = kornia.utils.create_meshgrid3d(depth=self.depth,
height=self.height,
width=self.width,
normalized_coordinates=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it necessary to use kornia? It seems we could simply implement this function with native PyTorch operations.
Such as implement it within one file of pcdet/utils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could re-implement kornia functions, however I use seven different functions throughout the code. Adding these implementations adds additional code in this repo which I don't feel is necessary. Additionally, I already require adding a dependency (torchvision), so the requirements need to updated anyways.

Kornia Functions:
kornia.image_to_tensor
kornia.utils.create_meshgrid3d
kornia.transform_points
kornia.normalize
kornia.losses.FocalLoss
kornia.convert_points_to_homogeneous
kornia.convert_points_from_homogeneous


__all__ = {
'FrustumToVoxel': FrustumToVoxel
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think f2v is a type of vfe (voxel feature encoding or extractor) by using frustum features instead of point-wise features.
So how about moving f2v to the folder of vfe, and create a module named like FrustumVFE as FrustumToVoxel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved f2v as a submodule of ImageVFE

@@ -6,7 +6,7 @@
from ...ops.iou3d_nms import iou3d_nms_utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify this file by considering f2v as vfe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, moved f2v as a submodule of ImageVFE

@@ -0,0 +1,19 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is no need to create a separate file for this simple function.
Such as we could merge grid_utils.py, depth_utils.py and transform_utils.py as a single file transform_utils.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,5 @@
from .depth_ffe import DepthFFE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it is better to also move ffe to the folder of 'vfe', since it seems ffe could only be used as a previous module of f2v.
If so, the overall framework will still keep simple and clear even with the implementation of CaDDN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved ffe as a submodule of ImageVFE

@sshaoshuai
Copy link
Collaborator

Nice codes! 

The only suggestion is that: how about fusing ffe+f2v as a new module of vfe? since it also aims to extract voxel-wise features from
image features. 
If so, the integration of CaDDN will be natural, and I think the overall framework will be more clear and will not affect the previous architecture of OpenPCDet for LiDAR-based 3D detection. 

@codyreading
Copy link
Contributor Author

codyreading commented May 18, 2021

Thanks for the quick review!

Sounds good, I'll update the requested changes and fuse the FFE + F2V in one module

@codyreading
Copy link
Contributor Author

This PR should be good to go. I made FFE (renamed this to FFN) and F2V as modules of ImageVFE, which extracts voxel features from an image.

Copy link
Collaborator

@sshaoshuai sshaoshuai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed.

@sshaoshuai sshaoshuai merged commit aaf9cbe into open-mmlab:master May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants