Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DiT] Question regarding image resolutions on different tasks #959

Open
mrvoh opened this issue Dec 23, 2022 · 3 comments
Open

[DiT] Question regarding image resolutions on different tasks #959

mrvoh opened this issue Dec 23, 2022 · 3 comments

Comments

@mrvoh
Copy link

mrvoh commented Dec 23, 2022

First of all, thanks for your general great work on document AI and specifically DiT in this case.

In the paper, it says

Since the image resolution for object detection tasks is much larger than classification, we limit the batch size to 16.

which confuses me. If I understand correctly, when using a pretrained version of DiT, one is ALWAYS limited by the 224x224 image resolution, since this is constrained by the size of the patch embeddings (similar to how e.g. BERT-base simply can't go beyond 512 tokens due to the position embeddings). So regardless of the original size of the image, the input the model gets is always limited to this predefined 224x224.

IF this reasoning is correct, then I cannot comprehend the logic behind resizing random crops of an image as described in the paper:

Specifically, the input image is cropped with probability 0.5 to a random rectangular patch which is then resized again such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1,333.

Any clarification for this would be very much appreciated, thanks in advance!

@mrvoh
Copy link
Author

mrvoh commented Dec 23, 2022

@NielsRogge sorry to bother you out of the blue, but since you've done so much of the groundwork of making this research easily available, I have good hopes your insights can show me the flaws in my reasoning..

@senthil-r-10
Copy link

@mrvoh Do you know the reason now?

@NielsRogge
Copy link

Hi,

Note that one can interpolate the pre-trained position embeddings to use the model at higher resolutions. This is probably what the authors did to use the model at a different resolution compared to pre-training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants