Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Time of Deformable Detr with Swin-base #1

Closed
ilovecv opened this issue Dec 4, 2021 · 5 comments
Closed

Inference Time of Deformable Detr with Swin-base #1

ilovecv opened this issue Dec 4, 2021 · 5 comments

Comments

@ilovecv
Copy link

ilovecv commented Dec 4, 2021

Hi,
From the results you provided in openreview, the inference time of deformable detr with swin-base is 4.8 FPS. However, from my testing, it is 8.1 FPS. I am using Tesla V100 GPU with batch size=1.

Screen Shot 2021-12-03 at 4 27 18 PM

@songhwanjun
Copy link
Collaborator

songhwanjun commented Dec 6, 2021

Hi,
The inference time of Deformable DETR could highly depend on how to define the multi-scale features (since the high computational overhead of the Deformable encoder). From Swin Transformer, we can use five scale feature maps in total, from patch embedding and four stages of Swin Block. In our implementation, we used the 'last four-scale feature maps' as the input to deformable attention, and the inference time was 4.8. We think that you used another combination of multiscale features. Could you try again with our setting? Similarly, we used four-scale feature maps obtained at the layer of 2, 4, 10, and 12 at DeiT.

Unlike Deformable DETR, ViDT achieves almost consistent inference time regardless of any combination of multi-scale feature maps by removing the computationally heaviest neck encoder.

@ilovecv
Copy link
Author

ilovecv commented Dec 6, 2021

Thanks for your clarification. I used the last three-scale feature maps, because I found in your swin code:https://github.com/naver-ai/vidt/blob/main/methods/swin_w_ram.py#L720, the out_indices=[1, 2, 3]. Also, in the deformable detr code here, https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/backbone.py#L76, only three-scale feature maps are used.

Is the performance gap big when using three-scale feature maps instead of four-scale feature maps on Coco dataset? Thanks.

@songhwanjun
Copy link
Collaborator

songhwanjun commented Dec 6, 2021

Thanks for the follow-up questions. There was a reasonable performance gap with four-scale feature maps since the deformable encoder at the neck can mix more fine-grained (low-level) features with semantically coarse-grained (high-level) features. That is, this is also a possible design choice for the trade-off between AP and FPS. We think it would be nice to add some analysis on the different setups of multi-scale feature maps in the next version.

We are sorry for the unclear implementation of our code. The current version is slightly different from the submission version since we are trying to achieve a better AP and FPS trade-off since the ICLR decision is not out (e.g., the out_indices=[1, 2, 3] is deprecated, and newly added components like FPN were added). We're still updating it, and we'll be releasing the next faster and more accurate version soon.

@ilovecv
Copy link
Author

ilovecv commented Dec 6, 2021

Got it, thank you very much for your detailed reply. Good luck with your paper!

@edgar2597
Copy link

@SanghyukChun @ilovecv hi, and where have you found script for training deformable DETR with swin backbone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants