You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all - currently looking into fine-tuning this model and have run into an issue with images of varying different sizes.
For this example:
max_height = 192, max_width = 672, patch_size=16
The error causing line is here:
x+=self.pos_embed[:, pos_emb_ind]
(pix2tex.models.hybrid line 25 in CustomVisionTransformer forward_features)
If I have an image of size 522 x 41, this line will throw an error.
X consists of 99 patches (+ the cls tokens) making it size [100, 256]
However, the positional embedding indices are only 66 in length. I am currently investigating this issue but don't quite understand the formula used to compute how many positional embedding indicies we are going to need. Right now it is computing 66 different indicies when we should be getting 100 different indicies. I think the issue arises when convolutions from the resnet embedder overlap and the formula doesn't account for this (it requires the image to be divisible by patch_size X patch_size for this formula to work).
If anyone has any thoughts on how to fix this let me know! I'm definitely no computer vision expert but I believe a simple change to account for overlapping convolutions in the embedding may be enough to fix this!
The text was updated successfully, but these errors were encountered:
For an explanation of the positional embeddings see the discussion here: #130
Your router shows up when the images are not dividable by the patch size
It is more efficient to pad the images beforehand but you can also set pad to true I'm the settings file
Hi all - currently looking into fine-tuning this model and have run into an issue with images of varying different sizes.
For this example:
max_height = 192, max_width = 672, patch_size=16
The error causing line is here:
(pix2tex.models.hybrid line 25 in CustomVisionTransformer forward_features)
If I have an image of size 522 x 41, this line will throw an error.
X consists of 99 patches (+ the cls tokens) making it size [100, 256]
However, the positional embedding indices are only 66 in length. I am currently investigating this issue but don't quite understand the formula used to compute how many positional embedding indicies we are going to need. Right now it is computing 66 different indicies when we should be getting 100 different indicies. I think the issue arises when convolutions from the resnet embedder overlap and the formula doesn't account for this (it requires the image to be divisible by patch_size X patch_size for this formula to work).
If anyone has any thoughts on how to fix this let me know! I'm definitely no computer vision expert but I believe a simple change to account for overlapping convolutions in the embedding may be enough to fix this!
The text was updated successfully, but these errors were encountered: