# PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

## Paper Reviews

### Prior Reseach

Point Cloud data 기반의 3D object detection은 다음과 같은 2가지 접근법으로 나눠져있었음
1. Point Cloud를 3D grid voxel로 나누어 **Sparse 3D convolution**을 적용하는 **Grid-based approach**   
→ computationally efficient 하지만 kernel size에 따라 receptive field로 인해 localization의 정확도 하락 

2. Point Cloud를 raw하게 **PointNet set abstraction**으로 representation을 얻어 적용하는 **Point-based approach**  
→ tight한 localization이 가능하게 되지만 point-wise로 계산된다는 측면에서 computationally expensive

∴ 각각의 advantage를 동시에 갖을 수 있도록 **두 approach를 적절히 결합**한 **2-staged detector**인 **PV-RCNN** 를 제안   
→ Gird-based approach로 RPN을 수행하여 bounding box를 얻고 해당 bounding box에서 Point-based approach로 boundary refinement

<p align="center">
<img width="959" alt="1" src="https://user-images.githubusercontent.com/86907286/201465648-a36e6266-d566-4bf1-a42c-d314a9595445.png">
</p>



### 3D Voxel CNN for Efficient Feature Encoding and Proposal Generation

Grid-based approach로서 input space를 voxel grid로 나누고 **Sparse 3D convolution**을 적용하는 efficient backbone이 제시되어 있음  
traditional convolution이 im2col로 연산되는 것과 달리 input/output valid point를 각각 **hash table로 저장**하고 **Rulebook**으로 연산 관계를 정의  
→ on-the-fly로 필요한 연산만을 수행할 수 있으므로 sparse data에 해당하는 Point Cloud에서 efficient backbone이 될 수 있음

<p align="center">
<img width="695" alt="2" src="https://user-images.githubusercontent.com/86907286/201465651-37349412-77ef-44fe-b810-fe6345924b95.png">
</p>

이렇게 효율적인 Sparse 3D convolution으로 3D volume feature 또는 2D BEV representation을 통해서 적절한 Anchor를 통해서 RPN 구성 가능  
그러나 이는 feature volume이 down-sampling 되므로 output volume의 spatial resolution이 너무 낮아져 object localization이 정확하지 못함  
→ 정확도를 올리기 위해 up-sampling으로 다시 되돌리면 sparse해지므로 computationally expensive해지는 문제가 발생

∴ Sparse 3D convolution backbone을 사용하면서 정확도가 손실된 region proposal을 **refine**하는 추가적인 architecture가 필요

### Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction

backbone에서 얻어지는 region proposal을 refine하기 위해 PoinNet의 **Set Abstraction**을 활용할 수 있음    
이를 통해 **proposal region 내부의 voxel에서의 keypoint**를 중심으로 non-empty voxel 내부의 semantic catch 가능  
→ 해당 keypoint들과 backbone에서 얻어지는 voxel set을 참고하여 **ROI-grid pooling**을 수행하여 bounding box refinement

#### Keypoint sampling

Point Cloud는 기본적으로 sparse하고 irregular하기 때문에 input을 raw하게 다루는데에 문제가 생김  
이를 위해서 FPS(Farthest Point Sampling)을 통해서 point를 얻어 최대한 uniform하게 참조할 수 있도록 **keypoint**를 만들어야 함  
→ 이는 **empty voxel이 아닌 grid에서 point가 uniform해지므로** backbone feature를 참조할 때 좋은 point subset이 될 수 있음

#### Voxel Set Abstraction Module

raw keypoint만을 통해 Set Abstraction을 진행하여 semantic feature을 얻기 위해서는 receptive field 측면에서 부족한 점이 많음  
따라서 backbone에서 얻어지는 voxel-wise feature를 neighborhood로 동시에 사용하여 **multi-scale semantic feature**를 얻어내는 방법을 선택 가능  
→ $k$-th level에서의 $n$ keypoints $\mathcal{K} = \{p_i\}$ 마다의 voxel neighboorhood set $\mathcal{S_i}^{(l_k)}$에도 Set Abstraction을 추가적으로 적용해 concatenate  

$$ S_i^{(l_k)} = \{[f^{(l_k)}_j; v^{(l_k)}_j - p_i]^T | \|v^{(l_k)}_j - p_i\|^2 < r_k, \forall f^{(l_k)}_j \in \mathcal{F}^{(l_k)},\forall v^{(l_k)}_j \in \mathcal{V}^{(l_k)} \}$$

이때 $\mathcal{F}^{(l_k)} = \{f^{(l_k)}_j\}$는 $k$-th level feature vector이며 해당 feature vector의 original space coordinate는 $\mathcal{V}^{(l_k)} = \{v^{(l_k)}_j\}$ 로 표현   
$k$-th level에서의 neighborhood radius를 결정하는 $r_k$는 다양한 receptive field를 얻기 위해 $k$-th level마다 2종류씩 사용


이렇게 얻은 voxel neighboorhood set $S_i^{(l_k)}$를 Set Abstraction 방식을 통해 key point $p_i$를 표현하는 feature vector $f_i^{(pv_k)}$를 얻음  
이때 $S_i^{(l_k)}$에서도 computational efficiency를 위해 최대 $T_k$의 voxel만큼 random sampling하는 $\mathcal{M}(\cdot)$을 적용한 후 MLP $G(\cdot)$, max pooling을 적용

$$ f_i^{(pv_k)} = \max \{ G(\mathcal{M}(S_i^{(l_k)})) \}$$

최종적으로 모든 $k$-th level, 제안된 Architecture로는 4단계의 $f_i^{(pv_k)}$를 concatenate하여 keypoint $p_i$에 대한 multi-scale semantic feature로 사용

$$ f_i^{(pv)} = [f_i^{(pv_1)}, f_i^{(pv_2)}, f_i^{(pv_3)}, f_i^{(pv_4)}] \text{ for } i = 1, 2, \cdots, n $$

#### Extended VSA Module

Voxel Set Abstraction layer를 통해서 얻어진 feature vector를 그대로 사용하지 않고 추가적인 feature로 $f_i^{(raw)}$와 $f_i^{(bev)}$를 결합  
$f_i^{(raw)}$는 raw Point Cloud에서 keypoint에 대해 얻어진 feature로서 voxelization으로 발생한 quantization을 보상해주기 위함   
$f_i^{(bev)}$와 bilinear interpolation으로 backbone을 통해 얻어진 keypoint에 대해 down-sampled 2D BEV feature로서 overall semantic을 더해줌  

$$ f_i^{(p)} = [f_i^{(pv)}, f_i^{(raw)}, f_i^{(bev)}] \text{ for } i = 1, 2, \cdots, n $$ 

#### Predicted Keypoint Weighting

앞선 과정에서 keypoint들은 FPS로 sampling되었기 때문에 실제로 관심을 갖는 object보다 **background**를 나타내는 point가 많게 됨  
그러나 insight 측면에서 refinement에서 background는 그렇게 중요하지 않기 때문에 **foreground point가 더 contribution을 하도록** weighting이 필요  
→ 이를 ground-truth bounding box를 통해 Point Cloud의 segmentation label을 만들어 supervision을 통해 weighted feature $\tilde{f}_i^{(p)}$를 사용

ground-truth bounding box가 있다면 해당 box에 각각의 keypoint가 들어가있는지 아닌지에 대한 binary classification label을 얻을 수 있음  
이를 통해 적절한 MLP $\mathcal{A}(\cdot)$를 통해 주어진 keypoint feature의 weight를 예측하도록 구성하여 focal loss를 통해서 학습

$$ \tilde{f}_i^{(p)} = \mathcal{A}(f_i^{(p)}) \cdot f_i^{(p)} $$

<p align="center">
<img width="460" alt="3" src="https://user-images.githubusercontent.com/86907286/201465654-36b8e618-fc2c-462a-b1bd-28325c662e19.png">
</p>



### Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement

#### RoI-grid Pooling via Set Abstraction


VSA module을 통해 얻어진 것은 일부의 3D multi-scale feature keypoint features $\mathcal{F}=\{\tilde{f}_1^{(p)}, \cdots, \tilde{f}_n^{(p)} \}$ 라고 할 수 있음  
따라서 이를 활용하여 3D proposal을 robust하고 accurate하게 refine하기 위해 $\mathcal{F}$의 semantic을 region proposal에 반영하는 **RoI-grid pooling**이 추가적으로 필요    
$6 \times 6 \times 6$으로 3D proposal 내부의 point를 uniform sampling하여 $\mathcal{G} = \{ g_1, \cdots, g_{216} \}$를 얻고 이들을 keypoint와 비교하는 것으로 Set Abstraction 수행  
이때 Set Abstraction을 사용하기 위한 keypoint neighborhood set $\tilde{\Psi}$을 VSA module로 얻은 feature와 keypoint와의 relative distance를 concatenate하여 표현  

$$ \tilde{\Psi} = \{[\tilde{f}_j^{(p)}; p_j - g_i]^T | \|p_j - g_i\|^2 < \tilde{r}, \forall p_j \in \mathcal{K},\forall \tilde{f}_j^{(p)} \in \tilde{\mathcal{F}} \} $$


이렇게 얻은 $\tilde{\Psi}$를 앞선 Voxel Set Abstraction과 유사하게 Set Abstraction을 수행하여 **grid sample point $g_i$에 대한 feature**를 얻음  
마찬가지로 다양한 receptive field를 얻기 위해 neighborhood radii $\tilde{r}$은 2종류를 선택하여 구현되었음  

$$ \tilde{f}_i^{(g)} = \max \{ G(\mathcal{M}(\tilde{\Psi})) \}$$

이렇게 같은 RoI에 존재하는 모든 grid sample마다 얻어진 feature는 모아져 256 feature dimension output의 two-layer MLP를 통해 RoI-feature로 transform됨  
이는 receptive field를 통해 얻어지는 context라 **RoI boundary에 존재하는 keypoint가 boundary를 넘어서까지의 context**를 볼 수 있도록 도움  
boundary의 정보는 **boundary를 기준으로 안과 밖의 context를 비교하는 중요하므로** average를 하거나 uninformative한 정보를 제외하는 것과 다른 장점이 있음  

<p align="center">
<img width="462" alt="4" src="https://user-images.githubusercontent.com/86907286/201465658-9acf9330-8cc3-4585-91dc-418b7b23db48.png">
</p>

#### 3D Proposal Refinement and Confidence Prediction

proposal region에 대한 RoI feature를 통해 실제로 Voxel CNN backbone에서 얻어진 proposal region을 refine하기 위한 ground-truth와의 잔차를 학습시킬 필요가 있음  
이를 위해 2개의 branch로 나누어진 two-layer MLP를 사용해 confidence prediction과 box refinement branch로 나누어 추가적인 head를 구성  
confidence prediction branch는 ground-truth box와 proposal box와의 $k$-th 3D IoU를 통해 confidence target $y_k \in [0, 1]$를 얻어 binary-cross entropy로 학습  

$$ y_k = \min (1, \max (0, 2\text{IoU}_k - 0.5)) $$
$$ \mathcal{L}_{\text{iou}} = -y_k \log (\tilde{y}_k) - (1 - y_k) \log (\tilde{y}_k) $$

box refinement branch는 일반적인 box regressor처럼 smooth-L1 loss를 통해서 box residual를 예측하도록 학습 

$$ \text{smooth}_{L_1}(x) = \begin{cases}
  0.5x^2 & \text{if } |x| < 1, \\
  |x|-0.5 & \text{otherwise}
\end{cases}$$

### Training Losses

최종적인 training loss $\mathcal{L}$은 backbone에서 바로 얻어지는 region proposal loss $\mathcal{L_{\text{rpn}}}$, keypoint segmentation loss $\mathcal{L}_{\text{seg}}$, proposal refinement loss $\mathcal{L}_{\text{rcnn}}$로 구성됨  

$$ \mathcal{L} = \mathcal{L_{\text{rpn}}} + \mathcal{L}_{\text{seg}} + \mathcal{L}_{\text{rcnn}} $$ 

$\mathcal{L_{\text{rpn}}}$은 focal loss로 계산되는 $\mathcal{L_{\text{cls}}}$과 anchor box와 ground-truth의 residual $\hat{\Delta \text{r}^a}$을 예측하는 smooth-L1 loss $\mathcal{L_{\text{smooth-L1}}}$로 구성됨

$$ \mathcal{L_{\text{rpn}}} = \mathcal{L_{\text{cls}}} + \beta \sum_{\text{r} \in \{x, y, z, l, h, w, \theta\}} \mathcal{L_{\text{smooth-L1}}} (\hat{\Delta \text{r}^a}, \Delta \text{r}^a) $$

$\mathcal{L}_{\text{seg}}$는 ground-truth bounding box 안에 있는지 없는지에 대한 binary classification loss로 표현됨  
$\mathcal{L}_{\text{rcnn}}$는 $\mathcal{L}_{\text{iou}}$와 함께 anchor box를 통해 proposed된 predicted box와 ground truth box와의 residual $\hat{\Delta \text{r}^p} = \hat{\Delta \text{r}^a}$에 대한 smooth-L1 loss $\mathcal{L_{\text{smooth-L1}}}$로 구성

$$ \mathcal{L}_{\text{rcnn}} = \mathcal{L_{\text{cls}}} + \beta \sum_{\text{r} \in \{x, y, z, l, h, w, \theta\}} \mathcal{L_{\text{smooth-L1}}} (\hat{\Delta \text{r}^p}, \Delta \text{r}^p) $$

### Ablation Studies

제안된 module들에 대한 ablation study로 다음과 같은 장점들을 확인할 수 있음

1. voxel-to-keypoint scene encoding  
→ 3D voxel CNN에 추가적인 keypoint 정보를 통한 segmentation supervision을 multi-scale learning 측면에서 좋은 효과를 주는 것을 확인 가능

<p align="center">
<img width="465" alt="5" src="https://user-images.githubusercontent.com/86907286/201465659-0277f7e8-a7eb-4f9b-97a2-bb4a85104a4c.png">
</p>


2. Voxel Set Abstraction module   
→ 단순히 raw point feature $f_i^{(raw)}$만을 사용하는 것보다 $f_i^{(pv_k)}$를 같이 사용하는 것이 성능 향상폭에는 차이가 있지만 도움을 준다는 것을 확인 가능 

<p align="center">
<img width="465" alt="6" src="https://user-images.githubusercontent.com/86907286/201465661-98dbea31-c358-4124-a12f-1fe378309075.png">
</p>


3. Predicted Keypoint Weighting module   
→ 적용한 것과 적용하지 않은 것의 경우 성능 차이가 많이 나므로 multi-scale feature aggregation 측면에서 foreground/background를 구분하는게 중요하다는 것을 알 수 있음

4. RoI-grid pooling module  
→ 기존의 approach를 사용하는 경우보다 Moderate, Hard case에서 성능의 차이가 나며, 특히 IoU를 통해 confidence target을 설정하는게 효과적임을 확인 가능

<p align="center">
<img width="462" alt="7" src="https://user-images.githubusercontent.com/86907286/201465662-401fb979-b871-4c9a-aeb0-41274e24b7ea.png">
</p>

## Implementation Reviews

Sparse CNN

In [None]:
class SparseCNNBase(nn.Module):
    """
    block      shape    stride
    0    [ 4, 8y, 8x, 41]    1
    1    [32, 4y, 4x, 21]    2
    2    [64, 2y, 2x, 11]    4
    3    [64, 1y, 1x,  5]    8
    4    [64, 1y, 1x,  2]    8
    """

    def __init__(self, cfg):
        """grid_shape given in ZYX order."""
        super(SparseCNNBase, self).__init__()
        self.cfg = cfg
        self.grid_shape = compute_grid_shape(cfg)
        self.base_voxel_size = torch.cuda.FloatTensor(cfg.VOXEL_SIZE)
        self.voxel_offset = torch.cuda.FloatTensor(cfg.GRID_BOUNDS[:3])
        self.make_blocks(cfg)

    def make_blocks(self, cfg):
        """Subclasses must implement this method."""
        raise NotImplementedError

    def maybe_bias_init(self, module, val):
        if hasattr(module, "bias") and module.bias is not None:
            nn.init.constant_(module.bias, val)

    def kaiming_init(self, module):
        nn.init.kaiming_normal_(
            module.weight, a=0, mode='fan_out', nonlinearity='relu')
        self.maybe_bias_init(module, 0)

    def batchnorm_init(self, module):
        nn.init.constant_(module.weight, 1)
        self.maybe_bias_init(module, 0)

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                self.kaiming_init(m)
            elif isinstance(m, _BatchNorm):
                self.batchnorm_init(m)

    def to_global(self, stride, volume):
        """
        Convert integer voxel indices to metric coordinates.
        Indices are reversed ijk -> kji to maintain correspondence with xyz.
        Sparse voxels are padded with subsamples to allow batch PointNet processing.
        :voxel_size length-3 tensor describing size of atomic voxel, accounting for stride.
        :voxel_offset length-3 tensor describing coordinate offset of voxel grid.
        """
        index = torch.flip(volume.indices, (1,))
        voxel_size = self.base_voxel_size * stride
        xyz = index[..., 0:3].float() * voxel_size
        xyz = (xyz + self.voxel_offset)
        xyz = self.pad_batch(xyz, index[..., -1], volume.batch_size)
        feature = self.pad_batch(volume.features, index[..., -1], volume.batch_size)
        return xyz, feature

    def compute_pad_amounts(self, batch_index, batch_size):
        """Compute padding needed to form dense minibatch."""
        helper_index = torch.arange(batch_size + 1, device=batch_index.device)
        helper_index = helper_index.unsqueeze(0).contiguous().int()
        batch_index = batch_index.unsqueeze(0).contiguous().int()
        start_index = searchsorted(batch_index, helper_index).squeeze(0)
        batch_count = start_index[1:] - start_index[:-1]
        pad = list((batch_count.max() - batch_count).cpu().numpy())
        batch_count = list(batch_count.cpu().numpy())
        return batch_count, pad

    def pad_batch(self, x, batch_index, batch_size):
        """Pad sparse tensor with subsamples to form dense minibatch."""
        if batch_size == 1:
            return x.unsqueeze(0)
        batch_count, pad = self.compute_pad_amounts(batch_index, batch_size)
        chunks = x.split(batch_count)
        pad_values = [random_choice(c, n) for (c, n) in zip(chunks, pad)]
        chunks = [torch.cat((c, p)) for (c, p) in zip(chunks, pad_values)]
        return torch.stack(chunks)

    def to_bev(self, volume):
        """Collapse z-dimension to form BEV feature map."""
        volume = volume.dense()
        N, C, D, H, W = volume.shape
        bev = volume.view(N, C * D, H, W)
        return bev

    def forward(self, features, coordinates, batch_size):
        x0 = spconv.SparseConvTensor(
            features, coordinates.int(), self.grid_shape, batch_size
        )
        x1 = self.blocks[0](x0)
        x2 = self.blocks[1](x1)
        x3 = self.blocks[2](x2)
        x4 = self.blocks[3](x3)
        x4 = self.to_bev(x4)
        args = zip(self.cfg.STRIDES, (x0, x1, x2, x3))
        x = list(itertools.starmap(self.to_global, args))
        return x, x4

Proposal head

In [None]:
class ProposalLayer(nn.Module):
    """
    Use BEV feature map to generate 3D box proposals.
    TODO: Fix long variable names, ugly line wraps.
    """

    def __init__(self, cfg):
        super(ProposalLayer, self).__init__()
        self.cfg = cfg
        self.conv_cls = nn.Conv2d(
            cfg.PROPOSAL.C_IN, cfg.NUM_CLASSES * cfg.NUM_YAW, 1)
        self.conv_reg = nn.Conv2d(
            cfg.PROPOSAL.C_IN, cfg.NUM_CLASSES * cfg.NUM_YAW * cfg.BOX_DOF, 1)
        self.TOPK, self.DOF = cfg.PROPOSAL.TOPK, cfg.BOX_DOF
        self._init_weights()

    def _init_weights(self):
        nn.init.constant_(self.conv_cls.bias, (-math.log(1 - .01) / .01))
        nn.init.constant_(self.conv_reg.bias, 0)
        for m in (self.conv_cls.weight, self.conv_reg.weight):
            nn.init.normal_(m, std=0.01)

    def _generate_group_idx(self, B, n_cls):
        """Compute unique group_idx based on (batch_idx, class_idx) tuples."""
        batch_idx = torch.arange(B)[:, None].expand(-1, n_cls)
        class_idx = torch.arange(n_cls)[None, :].expand(B, -1)
        group_idx = class_idx + n_cls * batch_idx
        b, c, g = [x[..., None].expand(-1, -1, self.TOPK).reshape(-1)
            for x in (batch_idx, class_idx, group_idx)]
        return b, c, g

    def _above_score_thresh(self, scores, class_idx):
        """Classes may have different score thresholds."""
        thresh = scores.new_tensor([a['score_thresh'] for a in self.cfg.ANCHORS])
        mask = scores > thresh[class_idx]
        return mask

    def _multiclass_batch_nms(self, boxes, scores):
        """Only boxes with same group_idx are jointly considered in nms"""
        B, n_cls = scores.shape[:2]
        scores = scores.view(-1)
        boxes = boxes.view(-1, self.DOF)
        bev_boxes = boxes[:, [0, 1, 3, 4, 6]]
        batch_idx, class_idx, group_idx = self._generate_group_idx(B, n_cls)
        idx = batched_nms_rotated(bev_boxes, scores, group_idx, iou_threshold=0.01)
        boxes, batch_idx, class_idx, scores = \
            [x[idx] for x in (boxes, batch_idx, class_idx, scores)]
        mask = self._above_score_thresh(scores, class_idx)
        out = [x[mask] for x in (boxes, batch_idx, class_idx, scores)]
        return out

    def _decode(self, reg_map, anchors, anchor_idx):
        """Expands anchors in batch dimension and calls decode."""
        B, n_cls = reg_map.shape[:2]
        anchor_idx = anchor_idx[..., None].expand(-1, -1, -1, self.DOF)
        deltas = reg_map.reshape(B, n_cls, -1, self.cfg.BOX_DOF) \
            .gather(2, anchor_idx)
        anchors = anchors.view(1, n_cls, -1, self.cfg.BOX_DOF) \
            .expand(B, -1, -1, -1).gather(2, anchor_idx)
        boxes = decode(deltas, anchors)
        return boxes

    def inference(self, feature_map, anchors):
        """:return (boxes, batch_idx, class_idx, scores)"""
        cls_map, reg_map = self(feature_map)
        score_map = cls_map.sigmoid_()
        B, n_cls = score_map.shape[:2]
        scores, anchor_idx = score_map.view(B, n_cls, -1).topk(self.TOPK, -1)
        boxes = self._decode(reg_map, anchors, anchor_idx)
        out = self._multiclass_batch_nms(boxes, scores)
        return out

    def reshape_cls(self, cls_map):
        B, _, ny, nx = cls_map.shape
        shape = (B, self.cfg.NUM_CLASSES, self.cfg.NUM_YAW, ny, nx)
        cls_map = cls_map.view(shape)
        return cls_map

    def reshape_reg(self, reg_map):
        B, _, ny, nx = reg_map.shape
        shape = (B, self.cfg.NUM_CLASSES, self.cfg.BOX_DOF, -1, ny, nx)
        reg_map = reg_map.view(shape).permute(0, 1, 3, 4, 5, 2)
        return reg_map

    def forward(self, feature_map):
        cls_map = self.reshape_cls(self.conv_cls(feature_map))
        reg_map = self.reshape_reg(self.conv_reg(feature_map))
        return cls_map, 

Refinement head

In [None]:
class RefinementLayer(nn.Module):
    """
    Uses pooled features to refine proposals.
    TODO: Pass class predictions from proposals since this
        module only predicts confidence.
    TODO: Implement RefinementLoss.
    TODO: Decide if decode box predictions / apply box
        deltas here or elsewhere.
    """

    def __init__(self, cfg):
        super(RefinementLayer, self).__init__()
        self.mlp = self.build_mlp(cfg)
        self.cfg = cfg

    def build_mlp(self, cfg):
        """
        TODO: Check if should use bias.
        """
        channels = cfg.REFINEMENT.MLPS + [cfg.BOX_DOF + 1]
        mlp = MLP(channels, bias=True, bn=False, relu=[True, False])
        return mlp

    def apply_refinements(self, box_deltas, boxes):
        raise NotImplementedError

    def inference(self, points, features, boxes):
        box_deltas, scores = self(points, features, boxes)
        boxes = self.apply_refinements(box_deltas, boxes)
        scores = scores.sigmoid()
        positive = 1 - scores[..., -1:]
        _, indices = torch.topk(positive, k=self.cfg.PROPOSAL.TOPK, dim=1)
        indices = indices.expand(-1, -1, self.cfg.NUM_CLASSES)
        box_indices = indices[..., None].expand(-1, -1, -1, self.cfg.BOX_DOF)
        scores = scores.gather(1, indices)
        boxes = boxes.gather(1, box_indices)
        return boxes, scores, indices

    def forward(self, points, features, boxes):
        refinements = self.mlp(features.permute(0, 2, 1))
        box_deltas, scores = refinements.split(1)
        return box_deltas, scores

RoI-grid pooling

In [None]:
class RoiGridPool(nn.Module):
    """
    Pools features from within proposals.
    TODO: I think must be misunderstanding dimensions claimed in paper.
        If sample 216 gridpoints in each proposal, and keypoint features
        are of dim 256, and gridpoint features are vectorized before linear layer,
        causes 216 * 256 * 256 parameters in reduction...
    TODO: Document input and output sizes.
    """

    def __init__(self, cfg):
        super(RoiGridPool, self).__init__()
        self.pnet = self.build_pointnet(cfg)
        self.reduction = MLP(cfg.GRIDPOOL.MLPS_REDUCTION)
        self.cfg = cfg

    def build_pointnet(self, cfg):
        """Copy channel list because PointNet modifies it in-place."""
        pnet = PointnetSAModuleMSG(
            npoint=-1, radii=cfg.GRIDPOOL.RADII_PN,
            nsamples=cfg.SAMPLES_PN,
            mlps=deepcopy(cfg.GRIDPOOL.MLPS_PN), use_xyz=True,
        )
        return pnet

    def rotate_z(self, points, theta):
        """
        Rotate points by theta around z-axis.
        :points (b, n, m, 3)
        :theta (b, n)
        :return (b, n, m, 3)
        """
        b, n, m, _ = points.shape
        theta = theta.unsqueeze(-1).expand(-1, -1, m)
        xy, z = torch.split(points, [2, 1], dim=-1)
        c, s = torch.cos(theta), torch.sin(theta)
        R = torch.stack((c, -s, s, c), dim=-1).view(b, n, m, 2, 2)
        xy = torch.matmul(R, xy.unsqueeze(-1))
        xyz = torch.cat((xy.squeeze(-1), z), dim=-1)
        return xyz

    def sample_gridpoints(self, boxes):
        """
        Sample axis-aligned points, then rotate.
        :return (b, n, ng, 3)
        """
        b, n, _ = boxes.shape
        m = self.cfg.GRIDPOOL.NUM_GRIDPOINTS
        gridpoints = boxes[:, :, None, 3:6] * \
            (torch.rand((b, n, m, 3), device=boxes.device) - 0.5)
        gridpoints = boxes[:, :, None, 0:3] + \
            self.rotate_z(gridpoints, boxes[..., -1])
        return gridpoints

    def forward(self, proposals, keypoint_xyz, keypoint_features):
        b, n, _ = proposals.shape
        m = self.cfg.GRIDPOOL.NUM_GRIDPOINTS
        gridpoints = self.sample_gridpoints(proposals).view(b, -1, 3)
        features = self.pnet(keypoint_xyz, keypoint_features, gridpoints)[1]
        features = features.view(b, -1, n, m) \
            .permute(0, 2, 1, 3).contiguous().view(b, n, -1)
        features = self.reduction(features)
        return features

Main structure

In [None]:
class PV_RCNN(nn.Module):
    """
    TODO: Improve docstrings.
    TODO: Some docstrings may claim incorrect dimensions.
    TODO: Figure out clean way to handle proposals_only forward.
    """

    def __init__(self, cfg):
        super(PV_RCNN, self).__init__()
        self.pnets = self.build_pointnets(cfg)
        self.roi_grid_pool = RoiGridPool(cfg)
        self.vfe = VoxelFeatureExtractor()
        self.cnn = CNN_FACTORY[cfg.CNN](cfg)
        self.bev = BEVFeatureGatherer(
            cfg, self.cnn.voxel_offset, self.cnn.base_voxel_size)
        self.proposal_layer = ProposalLayer(cfg)
        self.refinement_layer = RefinementLayer(cfg)
        self.cfg = cfg

    def build_pointnets(self, cfg):
        """Copy list because PointNet modifies it in-place."""
        pnets = []
        for i, mlps in enumerate(cfg.PSA.MLPS):
            pnets += [PointnetSAModuleMSG(
                npoint=-1, radii=cfg.PSA.RADII[i],
                nsamples=cfg.SAMPLES_PN,
                mlps=deepcopy(mlps), use_xyz=True,
            )]
        return nn.Sequential(*pnets)

    def sample_keypoints(self, points):
        """
        fps expects points shape (B, N, 3)
        fps returns indices shape (B, K)
        gather expects features shape (B, C, N)
        """
        points = points[..., :3].contiguous()
        indices = furthest_point_sample(points, self.cfg.NUM_KEYPOINTS)
        keypoints = gather_operation(points.transpose(1, 2).contiguous(), indices)
        keypoints = keypoints.transpose(1, 2).contiguous()
        return keypoints

    def _pointnets(self, cnn_out, keypoint_xyz):
        """xyz (B, N, 3) | features (B, N, C) | new_xyz (B, M, C) | return (B, M, Co)"""
        pnet_out = []
        for (voxel_xyz, voxel_features), pnet in zip(cnn_out, self.pnets):
            voxel_xyz = voxel_xyz.contiguous()
            voxel_features = voxel_features.transpose(1, 2).contiguous()
            out = pnet(voxel_xyz, voxel_features, keypoint_xyz)[1]
            pnet_out += [out]
        return pnet_out

    def point_feature_extract(self, item, cnn_features, bev_map):
        points_split = torch.split(item['points'], [3, 1], dim=-1)
        cnn_features = [points_split] + cnn_features
        point_features = self._pointnets(cnn_features, item['keypoints'])
        bev_features = self.bev(bev_map, item['keypoints'])
        point_features = torch.cat(point_features + [bev_features], dim=1)
        return point_features

    def proposal(self, item):
        item['keypoints'] = self.sample_keypoints(item['points'])
        features = self.vfe(item['features'], item['occupancy'])
        cnn_features, bev_map = self.cnn(features, item['coordinates'], item['batch_size'])
        scores, boxes = self.proposal_layer(bev_map)
        item.update(dict(P_cls=scores, P_reg=boxes))
        return item

    def forward(self, item):
        raise NotImplementedError

## Reference

- https://arxiv.org/abs/1912.13192
- https://towardsdatascience.com/how-does-sparse-convolution-work-3257a0a8fd1  
- https://github.com/jhultman/vision3d/tree/master/vision3d/detector