Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in computing positional embeddings for patches #226

Open
ByrdOfAFeather opened this issue Dec 29, 2022 · 1 comment
Open

Possible bug in computing positional embeddings for patches #226

ByrdOfAFeather opened this issue Dec 29, 2022 · 1 comment

Comments

@ByrdOfAFeather
Copy link

ByrdOfAFeather commented Dec 29, 2022

Hi all - currently looking into fine-tuning this model and have run into an issue with images of varying different sizes.
For this example:

max_height = 192, max_width = 672, patch_size=16

The error causing line is here:

        x += self.pos_embed[:, pos_emb_ind]

(pix2tex.models.hybrid line 25 in CustomVisionTransformer forward_features)

If I have an image of size 522 x 41, this line will throw an error.
X consists of 99 patches (+ the cls tokens) making it size [100, 256]

However, the positional embedding indices are only 66 in length. I am currently investigating this issue but don't quite understand the formula used to compute how many positional embedding indicies we are going to need. Right now it is computing 66 different indicies when we should be getting 100 different indicies. I think the issue arises when convolutions from the resnet embedder overlap and the formula doesn't account for this (it requires the image to be divisible by patch_size X patch_size for this formula to work).

If anyone has any thoughts on how to fix this let me know! I'm definitely no computer vision expert but I believe a simple change to account for overlapping convolutions in the embedding may be enough to fix this!

@lukas-blecher
Copy link
Owner

Hello,

For an explanation of the positional embeddings see the discussion here: #130

Your router shows up when the images are not dividable by the patch size
It is more efficient to pad the images beforehand but you can also set pad to true I'm the settings file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants