Explanation of Pyramid ROI Align #128

hondaathma · 2017-12-09T00:57:36Z

Hi,
Thanks a lot for this great implementation. I had some doubts on how you implemented the ROI ALign Layer.
Specifially this routine in :PyramidROIAlign :

I read the paper and understood how Kaiming interpolated using the 4 corner points.But I have trouble understanding how you have implemented it. Why is there a need of log2(in yellow box1)? I completely do not get the second yellow box? Do you mind explaining it with an example? Given an example image(800x800x3) a single feature pyramid map (1x256x100x100 : so 1/8th) and a single bounding box (x1:22,y1:22,x2:50,y2:50) how do i understand the interpolation?

Thanks a lot!

waleedka · 2017-12-10T01:43:08Z

The crop_and_resize() function in TensorFlow is doing all the heavy lifting here. It crops and resizes an image and handles the bilinear interpolation. The Mask RCNN paper mentions picking 4 points and interpolating between them, but they also mention that picking one point only is "nearly as effective", which is what crop_and_resize() does. At least that's my understanding of how it works, and I've seen it used in Google's own models.

The log2 in the first box is not related to ROIAlign directly. It simply tries to decide which of the different levels of the feature pyramid to use depending on the area of the ROI.

hondaathma · 2017-12-10T22:41:32Z

Thanks a lot for the informative answer! Just one last doubt. To find the ROI pyramid selection the formula uses 224 as the denominator. Do you know why they used that(I know they mention it is because of imagenet) ? Is it fair to use the same for COCO or cityscapes when the image size is 1024 or 800?

babakEhteshami · 2018-02-13T16:44:03Z

I also have the same question. Why is 224, hardcoded in the denominator?

FruVirus · 2018-02-22T20:48:57Z

In the FPN paper, Section 4.2, they adapt the assignment strategy of region-based detectors similar to that in Fast R-CNN. In Fast R-CNN, they experimented with three pre-trained ImageNet networks (AlexNet, VGG_CNN_M_1024, and VGG-16).

Mask R-CNN also sets hyper-parameters following the Fast/Faster R-CNN papers which implies that they fine-tuned the base CNN network (i.e., ResNet-101) using ImageNet data. Hence, I would assume that 224 is hard coded in this implementation of Mask R-CNN because of this logic.

As to why 224 and not some other number, I believe it has to do with how fine-tuning is done when using ImageNet data. Specifically, in ImageNet classification fine-tuning, 224 x 224 crops are randomly sampled for data augmentation. In addition, the FPN paper follows the ResNet paper approach where they used C4 as the single-scale feature map. This means that a 224 x 224 crop is assigned to the 4th level in the FPN (i.e., P4) and smaller/larger crops are assigned accordingly. Hence, the use of k0 = 4 and 224 in Eq. 1 of the FPN paper, Section 4.2.

If your base CNN network is fine-tuned using ImageNet data, I would assume that 224 is a safe assumption. As to the setting of k0, this would depend on the dynamic scale of the objects you are trying to detect. For example, if we set k0 = 4 and use P2 as the bottom level of the FPN, we have

2 = 4 + log2(sqrt(w * w) / 224) ---> w ~= 30. So this means that under the default settings, the FPN can account for objects as small as 30 pixels by using the P2 level feature map.

On the other hand, if we set k0 = 3, then a 30 pixel object would end up getting mapped to P1 (which we don't calculate).

waleedka · 2018-04-02T00:57:32Z

Thanks @FruVirus . Great explanation. Closing this issue.

DevinCheung · 2018-08-07T13:43:58Z

@FruVirus Thanks for your explanation! That really helps me a lot. I have some confusion about the example.

2 = 4 + log2(sqrt(w * w) / 224) ---> w ~= 30. Is this equatiion wrong? w shall be about 60. w ~= 60.

And, if I want to detect small things, (5*5 pixels, for example), I shall increase k0. Is it wright?

kleingeo · 2018-08-29T19:41:07Z

Is the 224 even necessary here? In the original paper since h and w were in pixels, not normalized coordinates, the dividing by 224 essentially normalized the sqrt(w * h), which would account for some length if w=h. But if this case, h and w are normalized, dividing by the 224 and the additional sqrt(image_area) is redundant, especially if sqrt(image_area) is 224. Wouldn't this also fix the issue the issue of having the 224 hard-coded in? If you are feeding an image that is 1024 by 1024, then shouldn't 224 be replaced with 1024? And if so, that would be the same as changing the 224 to sqrt(image_area) which when divided by itself is just 1.

shah-scalpel · 2021-06-09T15:37:36Z

@kleingeo

image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
h and w are in normalised coordinates, so we need to multiply with the image area to get the actual roi_image_area
roi_image_area = h * w * image_area
roi_level = log2_graph(tf.sqrt(roi_image_area) / (224.0 * 224.0))
which is equivalent to
roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))

The 224 x 224 crop is coming from the FPN paper, where the authors assign this particular crop size to the 4th level in the FPN (i.e., P4). One can definitely play with this crop size as a parameter, depending on the particular application.

@DevinCheung
There is an inherent problem with Mask R-CNN that it can not detect small objects. Increasing k_0 will not help in that case. The maximum value of k_0 is 4 in this setting. k can be only chosen from {3, 4}.

waleedka closed this as completed Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanation of Pyramid ROI Align #128

Explanation of Pyramid ROI Align #128

hondaathma commented Dec 9, 2017

waleedka commented Dec 10, 2017

hondaathma commented Dec 10, 2017

babakEhteshami commented Feb 13, 2018

FruVirus commented Feb 22, 2018

waleedka commented Apr 2, 2018

DevinCheung commented Aug 7, 2018

kleingeo commented Aug 29, 2018

shah-scalpel commented Jun 9, 2021 •

edited

Explanation of Pyramid ROI Align #128

Explanation of Pyramid ROI Align #128

Comments

hondaathma commented Dec 9, 2017

waleedka commented Dec 10, 2017

hondaathma commented Dec 10, 2017

babakEhteshami commented Feb 13, 2018

FruVirus commented Feb 22, 2018

waleedka commented Apr 2, 2018

DevinCheung commented Aug 7, 2018

kleingeo commented Aug 29, 2018

shah-scalpel commented Jun 9, 2021 • edited

shah-scalpel commented Jun 9, 2021 •

edited