Input image size #636

MarkIsDoingIt · 2023-07-20T20:39:32Z

MarkIsDoingIt
Jul 20, 2023

I want to use a pretrained vision transformer from clip to extract feature from images. My original image size is 10241024. What is the largest input image size for any clip pretrained version? Resize the image to classical 224224 will have a loss of information. Thanks!

christuchez · 2023-07-27T01:35:13Z

christuchez
Jul 27, 2023

Try the model ViT-L-14-336 which resizes to 336x336 and see if that works for your use case

0 replies

rwightman · 2023-09-15T23:22:47Z

rwightman
Sep 15, 2023
Maintainer

@MarkIsDoingIt Such is life in the world of NN, still stuck with relatively small images at current compute levevls.

The original models were trained at the stated image sizes, so that extra 'detail' that would be lost from yours won't be of much use to them anways, they never saw such detail in training... you'd need to train from scratch at 1024x1024 and that is incredibly expensive.

The 336x336 L/14 is good suggestion and I trained some convnext large and base models w/ 320x320. The convnext models that were trained with augreg can handle larger image sizes than train a bit better than most, and the convnext models will accept any image size as they are fully convolutional, you'll see significant drop in accuracy by 1024x1024 though since feature scales will be very different.

Moving to discussions for future ref.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input image size #636

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Input image size #636

MarkIsDoingIt Jul 20, 2023

Replies: 2 comments

christuchez Jul 27, 2023

rwightman Sep 15, 2023 Maintainer

MarkIsDoingIt
Jul 20, 2023

christuchez
Jul 27, 2023

rwightman
Sep 15, 2023
Maintainer