Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of Pyramid ROI Align #128

Closed
hondaathma opened this issue Dec 9, 2017 · 8 comments
Closed

Explanation of Pyramid ROI Align #128

hondaathma opened this issue Dec 9, 2017 · 8 comments

Comments

@hondaathma
Copy link

Hi,
Thanks a lot for this great implementation. I had some doubts on how you implemented the ROI ALign Layer.
Specifially this routine in :PyramidROIAlign :

image
image

I read the paper and understood how Kaiming interpolated using the 4 corner points.But I have trouble understanding how you have implemented it. Why is there a need of log2(in yellow box1)? I completely do not get the second yellow box? Do you mind explaining it with an example? Given an example image(800x800x3) a single feature pyramid map (1x256x100x100 : so 1/8th) and a single bounding box (x1:22,y1:22,x2:50,y2:50) how do i understand the interpolation?

Thanks a lot!

@waleedka
Copy link
Collaborator

The crop_and_resize() function in TensorFlow is doing all the heavy lifting here. It crops and resizes an image and handles the bilinear interpolation. The Mask RCNN paper mentions picking 4 points and interpolating between them, but they also mention that picking one point only is "nearly as effective", which is what crop_and_resize() does. At least that's my understanding of how it works, and I've seen it used in Google's own models.

The log2 in the first box is not related to ROIAlign directly. It simply tries to decide which of the different levels of the feature pyramid to use depending on the area of the ROI.

@hondaathma
Copy link
Author

Thanks a lot for the informative answer! Just one last doubt. To find the ROI pyramid selection the formula uses 224 as the denominator. Do you know why they used that(I know they mention it is because of imagenet) ? Is it fair to use the same for COCO or cityscapes when the image size is 1024 or 800?

@babakEhteshami
Copy link

I also have the same question. Why is 224, hardcoded in the denominator?

@FruVirus
Copy link

In the FPN paper, Section 4.2, they adapt the assignment strategy of region-based detectors similar to that in Fast R-CNN. In Fast R-CNN, they experimented with three pre-trained ImageNet networks (AlexNet, VGG_CNN_M_1024, and VGG-16).

Mask R-CNN also sets hyper-parameters following the Fast/Faster R-CNN papers which implies that they fine-tuned the base CNN network (i.e., ResNet-101) using ImageNet data. Hence, I would assume that 224 is hard coded in this implementation of Mask R-CNN because of this logic.

As to why 224 and not some other number, I believe it has to do with how fine-tuning is done when using ImageNet data. Specifically, in ImageNet classification fine-tuning, 224 x 224 crops are randomly sampled for data augmentation. In addition, the FPN paper follows the ResNet paper approach where they used C4 as the single-scale feature map. This means that a 224 x 224 crop is assigned to the 4th level in the FPN (i.e., P4) and smaller/larger crops are assigned accordingly. Hence, the use of k0 = 4 and 224 in Eq. 1 of the FPN paper, Section 4.2.

If your base CNN network is fine-tuned using ImageNet data, I would assume that 224 is a safe assumption. As to the setting of k0, this would depend on the dynamic scale of the objects you are trying to detect. For example, if we set k0 = 4 and use P2 as the bottom level of the FPN, we have

2 = 4 + log2(sqrt(w * w) / 224) ---> w ~= 30. So this means that under the default settings, the FPN can account for objects as small as 30 pixels by using the P2 level feature map.

On the other hand, if we set k0 = 3, then a 30 pixel object would end up getting mapped to P1 (which we don't calculate).

@waleedka
Copy link
Collaborator

waleedka commented Apr 2, 2018

Thanks @FruVirus . Great explanation. Closing this issue.

@waleedka waleedka closed this as completed Apr 2, 2018
@DevinCheung
Copy link

@FruVirus Thanks for your explanation! That really helps me a lot. I have some confusion about the example.

2 = 4 + log2(sqrt(w * w) / 224) ---> w ~= 30. Is this equatiion wrong? w shall be about 60. w ~= 60.

And, if I want to detect small things, (5*5 pixels, for example), I shall increase k0. Is it wright?

@kleingeo
Copy link

Is the 224 even necessary here? In the original paper since h and w were in pixels, not normalized coordinates, the dividing by 224 essentially normalized the sqrt(w * h), which would account for some length if w=h. But if this case, h and w are normalized, dividing by the 224 and the additional sqrt(image_area) is redundant, especially if sqrt(image_area) is 224. Wouldn't this also fix the issue the issue of having the 224 hard-coded in? If you are feeding an image that is 1024 by 1024, then shouldn't 224 be replaced with 1024? And if so, that would be the same as changing the 224 to sqrt(image_area) which when divided by itself is just 1.

@shah-scalpel
Copy link

shah-scalpel commented Jun 9, 2021

@kleingeo

image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
h and w are in normalised coordinates, so we need to multiply with the image area to get the actual roi_image_area
roi_image_area = h * w * image_area
roi_level = log2_graph(tf.sqrt(roi_image_area) / (224.0 * 224.0))
which is equivalent to
roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))

The 224 x 224 crop is coming from the FPN paper, where the authors assign this particular crop size to the 4th level in the FPN (i.e., P4). One can definitely play with this crop size as a parameter, depending on the particular application.

@DevinCheung
There is an inherent problem with Mask R-CNN that it can not detect small objects. Increasing k_0 will not help in that case. The maximum value of k_0 is 4 in this setting. k can be only chosen from {3, 4}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants