# **PointRend - Image Segmentation as Rendering**

**Authors: Alexander Kirillov, Yuxin Wu, Kaiming H,e Ross Girshick - Facebook AI Research (FAIR)**

**Official Github**: https://github.com/facebookresearch/detectron2/tree/main/projects/PointRend

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 10 2022

---

### **Abstract**



<table>
  <tr>
    <td>
      <strong>Abstract</strong>
    </td>
    <td>
      <strong>Key Summary</strong>
    </td>
  </tr>
  <tr>
      <td width="400">
        *We present a new method for efficient high-quality
        image segmentation of objects and scenes. By analogizing
        classical computer graphics methods for efficient rendering
        with over- and undersampling challenges faced in pixel
        labeling tasks, we develop a unique perspective of image
        segmentation as a rendering problem. From this vantage,
        we present the PointRend (Point-based Rendering) neural
        network module: a module that performs point-based
        segmentation predictions at adaptively selected locations
        based on an iterative subdivision algorithm. PointRend
        can be flexibly applied to both instance and semantic
        segmentation tasks by building on top of existing state-ofthe-art models. While many concrete implementations of
        the general idea are possible, we show that a simple design
        already achieves excellent results. Qualitatively, PointRend
        outputs crisp object boundaries in regions that are oversmoothed by previous methods. Quantitatively, PointRend
        yields significant gains on COCO and Cityscapes, for both
        instance and semantic segmentation. PointRend’s efficiency
        enables output resolutions that are otherwise impractical
        in terms of memory or computation compared to existing
        approaches. Code has been made available at https://
        github.com/facebookresearch/detectron2/
        tree/master/projects/PointRend.*
      </td>
      <td valign="top" width="600">
        <strong>Figure 1: Instance segmentation with PointRend.</strong>
        We introduce the PointRend (Point-based Rendering) module that makes predictions at adaptively sampled points on the image using a new pointbased feature representation (see Fig. 3). PointRend is general and
        can be flexibly integrated into existing semantic and instance segmentation systems. When used to replace Mask R-CNN’s default
        mask head [19] (top-left), PointRend yields significantly more detailed results (top-right). (bottom) During inference, PointRend iterative computes its prediction. Each step applies bilinear upsampling in smooth regions and makes higher resolution predictions
        at a small number of adaptively selected points that are likely to
        lie on object boundaries (black points). All figures in the paper are
        best viewed digitally with zoom. Image source: [41].
    </td>
  </tr>
</table>

---

### **Abstract [Key Summary]**

### **Introduction**

Image segmentation tasks involve mapping pixels sampled on a regular grid to a label map, or a set of label maps,
on the same grid. For semantic segmentation, the label map
indicates the predicted category at each pixel. In the case of
instance segmentation, a binary foreground vs. background
map is predicted for each detected object. The modern tools
of choice for these tasks are built on convolutional neural
networks (CNNs) [27, 26].

CNNs for image segmentation typically operate on regular grids: the input image is a regular grid of pixels, their
hidden representations are feature vectors on a regular grid,
and their outputs are label maps on a regular grid. Regular grids are convenient, but not necessarily computationally ideal for image segmentation. The label maps predicted by these networks should be mostly smooth, i.e.,
neighboring pixels often take the same label, because highfrequency regions are restricted to the sparse boundaries between objects. A regular grid will unnecessarily oversample
the smooth areas while simultaneously undersampling object boundaries. The result is excess computation in smooth
regions and blurry contours (Fig. 1, upper-left). Image segmentation methods often predict labels on a low-resolution
regular grid, e.g., 1/8-th of the input [35] for semantic segmentation, or 28×28 [19] for instance segmentation, as a
compromise between undersampling and oversampling.


<table>
  <tr>
      <td>
        <img src="./imgs/figure1.png" width="350px" />
      </td>
      <td valign="bottom" width="700">
        <strong>Figure 1: Instance segmentation with PointRend.</strong>
        We introduce the PointRend (Point-based Rendering) module that makes predictions at adaptively sampled points on the image using a new pointbased feature representation (see Fig. 3). PointRend is general and
        can be flexibly integrated into existing semantic and instance segmentation systems. When used to replace Mask R-CNN’s default
        mask head [19] (top-left), PointRend yields significantly more detailed results (top-right). (bottom) During inference, PointRend iterative computes its prediction. Each step applies bilinear upsampling in smooth regions and makes higher resolution predictions
        at a small number of adaptively selected points that are likely to
        lie on object boundaries (black points). All figures in the paper are
        best viewed digitally with zoom. Image source: [41].
    </td>
  </tr>
</table>

pic

Analogous sampling issues have been studied for
decades in computer graphics. For example, a renderer
maps a model (e.g., a 3D mesh) to a rasterized image, i.e. a
regular grid of pixels. While the output is on a regular grid,
computation is not allocated uniformly over the grid. Instead, a common graphics strategy is to compute pixel values at an irregular subset of adaptively selected points in the
image plane. The classical subdivision technique of [48], as
an example, yields a quadtree-like sampling pattern that efficiently renders an anti-aliased, high-resolution image.

The central idea of this paper is to view image segmentation as a rendering problem and to adapt classical
ideas from computer graphics to efficiently “render” highquality label maps (see Fig. 1, bottom-left). We encapsulate this computational idea in a new neural network
module, called PointRend, that uses a subdivision strategy
to adaptively select a non-uniform set of points at which
to compute labels. PointRend can be incorporated into
popular meta-architectures for both instance segmentation
(e.g., Mask R-CNN [19]) and semantic segmentation (e.g.,
FCN [35]). Its subdivision strategy efficiently computes
high-resolution segmentation maps using an order of magnitude fewer floating-point operations than direct, dense
computation.

PointRend is a general module that admits many possible implementations. Viewed abstractly, a PointRend
module accepts one or more typical CNN feature maps
f(xi
, yi) that are defined over regular grids, and outputs
high-resolution predictions p(x
0
i
, y0
i
) over a finer grid. Instead of making excessive predictions over all points on the
output grid, PointRend makes predictions only on carefully
selected points. To make these predictions, it extracts a
point-wise feature representation for the selected points by
interpolating f, and uses a small point head subnetwork to
predict output labels from the point-wise features. We will
present a simple and effective PointRend implementation.

We evaluate PointRend on instance and semantic segmentation tasks using the COCO [29] and Cityscapes [9]
benchmarks. Qualitatively, PointRend efficiently computes
sharp boundaries between objects, as illustrated in Fig. 2
and Fig. 8. We also observe quantitative improvements even
though the standard intersection-over-union based metrics
for these tasks (mask AP and mIoU) are biased towards
object-interior pixels and are relatively insensitive to boundary improvements. PointRend improves strong Mask RCNN and DeepLabV3 [5] models by a significant margin.

