You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in our paper, YOLOS is not designed to be a sophisticated high-performance object detector. On the contrary, we purposefully make modifications as few as possible on a given pre-trained ViT / DeiT to precisely unveil the versatility and transferability of Transformer from image recognition to object detection. So our paper is more about Transformer than object detection in a sense.
We argue that the 2D object detection is a quite hard task for naive Transformer since ViT always does seq2seq modeling, which means ViT tries to perceive higher dimension visual signal from a lower dimension perspective. Nevertheless, we observe that ViT can accomplish this task.
Transformer can benefit from super large-sized model and super large-scale pre-training. In our paper, we only use the mid-sized ImageNet-1k as the pre-training dataset, and the largest model we study has 128M parameters. Whether object detection results can benefit from the excellent scalability of Transformer is interesting.
❔Question
Congratulation for publishing a good work.
How is performance wrt to YOLO5 and other YOlo series and also its standing on Object detection LB.
Additional context
The text was updated successfully, but these errors were encountered: