You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thanks for your general great work on document AI and specifically DiT in this case.
In the paper, it says
Since the image resolution for object detection tasks is much larger than classification, we limit the batch size to 16.
which confuses me. If I understand correctly, when using a pretrained version of DiT, one is ALWAYS limited by the 224x224 image resolution, since this is constrained by the size of the patch embeddings (similar to how e.g. BERT-base simply can't go beyond 512 tokens due to the position embeddings). So regardless of the original size of the image, the input the model gets is always limited to this predefined 224x224.
IF this reasoning is correct, then I cannot comprehend the logic behind resizing random crops of an image as described in the paper:
Specifically, the input image is cropped with probability 0.5 to a random rectangular patch which is then resized again such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1,333.
Any clarification for this would be very much appreciated, thanks in advance!
The text was updated successfully, but these errors were encountered:
@NielsRogge sorry to bother you out of the blue, but since you've done so much of the groundwork of making this research easily available, I have good hopes your insights can show me the flaws in my reasoning..
Note that one can interpolate the pre-trained position embeddings to use the model at higher resolutions. This is probably what the authors did to use the model at a different resolution compared to pre-training.
First of all, thanks for your general great work on document AI and specifically DiT in this case.
In the paper, it says
which confuses me. If I understand correctly, when using a pretrained version of DiT, one is ALWAYS limited by the 224x224 image resolution, since this is constrained by the size of the patch embeddings (similar to how e.g. BERT-base simply can't go beyond 512 tokens due to the position embeddings). So regardless of the original size of the image, the input the model gets is always limited to this predefined 224x224.
IF this reasoning is correct, then I cannot comprehend the logic behind resizing random crops of an image as described in the paper:
Any clarification for this would be very much appreciated, thanks in advance!
The text was updated successfully, but these errors were encountered: