Inference Time of Deformable Detr with Swin-base #1

ilovecv · 2021-12-04T00:28:28Z

Hi,
From the results you provided in openreview, the inference time of deformable detr with swin-base is 4.8 FPS. However, from my testing, it is 8.1 FPS. I am using Tesla V100 GPU with batch size=1.

songhwanjun · 2021-12-06T04:53:10Z

Hi,
The inference time of Deformable DETR could highly depend on how to define the multi-scale features (since the high computational overhead of the Deformable encoder). From Swin Transformer, we can use five scale feature maps in total, from patch embedding and four stages of Swin Block. In our implementation, we used the 'last four-scale feature maps' as the input to deformable attention, and the inference time was 4.8. We think that you used another combination of multiscale features. Could you try again with our setting? Similarly, we used four-scale feature maps obtained at the layer of 2, 4, 10, and 12 at DeiT.

Unlike Deformable DETR, ViDT achieves almost consistent inference time regardless of any combination of multi-scale feature maps by removing the computationally heaviest neck encoder.

ilovecv · 2021-12-06T17:09:26Z

Thanks for your clarification. I used the last three-scale feature maps, because I found in your swin code:https://github.com/naver-ai/vidt/blob/main/methods/swin_w_ram.py#L720, the out_indices=[1, 2, 3]. Also, in the deformable detr code here, https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/backbone.py#L76, only three-scale feature maps are used.

Is the performance gap big when using three-scale feature maps instead of four-scale feature maps on Coco dataset? Thanks.

songhwanjun · 2021-12-06T19:05:54Z

Thanks for the follow-up questions. There was a reasonable performance gap with four-scale feature maps since the deformable encoder at the neck can mix more fine-grained (low-level) features with semantically coarse-grained (high-level) features. That is, this is also a possible design choice for the trade-off between AP and FPS. We think it would be nice to add some analysis on the different setups of multi-scale feature maps in the next version.

We are sorry for the unclear implementation of our code. The current version is slightly different from the submission version since we are trying to achieve a better AP and FPS trade-off since the ICLR decision is not out (e.g., the out_indices=[1, 2, 3] is deprecated, and newly added components like FPN were added). We're still updating it, and we'll be releasing the next faster and more accurate version soon.

ilovecv · 2021-12-06T19:09:05Z

Got it, thank you very much for your detailed reply. Good luck with your paper!

edgar2597 · 2022-05-30T17:46:02Z

@SanghyukChun @ilovecv hi, and where have you found script for training deformable DETR with swin backbone?

songhwanjun closed this as completed Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Time of Deformable Detr with Swin-base #1

Inference Time of Deformable Detr with Swin-base #1

ilovecv commented Dec 4, 2021

songhwanjun commented Dec 6, 2021 •

edited

ilovecv commented Dec 6, 2021

songhwanjun commented Dec 6, 2021 •

edited

ilovecv commented Dec 6, 2021 •

edited

edgar2597 commented May 30, 2022

Inference Time of Deformable Detr with Swin-base #1

Inference Time of Deformable Detr with Swin-base #1

Comments

ilovecv commented Dec 4, 2021

songhwanjun commented Dec 6, 2021 • edited

ilovecv commented Dec 6, 2021

songhwanjun commented Dec 6, 2021 • edited

ilovecv commented Dec 6, 2021 • edited

edgar2597 commented May 30, 2022

songhwanjun commented Dec 6, 2021 •

edited

songhwanjun commented Dec 6, 2021 •

edited

ilovecv commented Dec 6, 2021 •

edited