Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] About the bounding box of the VG data #606

Closed
Deaddawn opened this issue Oct 18, 2023 · 9 comments
Closed

[Question] About the bounding box of the VG data #606

Deaddawn opened this issue Oct 18, 2023 · 9 comments

Comments

@Deaddawn
Copy link

Deaddawn commented Oct 18, 2023

Question

Hi, there. I'm wondering if the bounding box [x,y,w,h] value in VG data has been modified because of the resize of the training process? Can you please elaborate on this detail? Thanks a lot!

@Deaddawn
Copy link
Author

it seems the range of it is [0,1]

@Deaddawn
Copy link
Author

hi i've noticed that too, it seems to be ratio of the corrdinates divided by the picture side.

in vg's region notations, the region is provided in x, y, width, height, and the picture has width has height(let's say, im_width, im_height, you can obtain from PIL.Image.open(...).size).

so in llava instructions, it is

  • x / im_width
  • y / im_height
  • (x+width) / im_width
  • (y+height) / im_height

i did some calculations are verified in llava's data.

OK,谢了老兄,我去试试可视化

@Deaddawn
Copy link
Author

hi i've noticed that too, it seems to be ratio of the corrdinates divided by the picture side.

in vg's region notations, the region is provided in x, y, width, height, and the picture has width has height(let's say, im_width, im_height, you can obtain from PIL.Image.open(...).size).

so in llava instructions, it is

  • x / im_width
  • y / im_height
  • (x+width) / im_width
  • (y+height) / im_height

i did some calculations are verified in llava's data.

It seems wrong, can you provide your exapmle?

@Deaddawn
Copy link
Author

Deaddawn commented Oct 20, 2023

Is it possible the normalization is based on the image shape 336x336, since the cli inference return image tensor 336x336? Does that sounds more resonable? I have tried several samples, seems work.

@Maxlinn
Copy link

Maxlinn commented Oct 22, 2023

i did some serious digging and find out is does based on image shape 336x336(or any square), but it is NOT as simple as resizing. it is padding the shorter edge to the longer, pouring processor.image_mean to the padding area. you can referring to code here in train.

the four floats are actually wa, ha, wb, hb, a is the left up corner and b is the right bottom corner, while w is width and h is height. it provides a rectangle on the padded image.

you can use this case to test:

{'id': 'VG_100K/2334275',
 'image': 'vg/VG_100K/2334275.jpg',
 'conversations': [
...
  {'from': 'human',
   'value': 'Please provide the bounding box coordinate of the region this sentence describes: 01 on the train.'},
  {'from': 'gpt', 'value': '[0.65, 0.29, 0.78, 0.43]'}
...]}

simply resizing would cause the number in this region to be incomplete, after padding it is correct.

image
image

@haotian-liu
Copy link
Owner

Thank you @Maxlinn and your understanding is correct. We will need to clarify this in our revised paper to make it clearer.

@gapjialin
Copy link

i did some serious digging and find out is does based on image shape 336x336(or any square), but it is NOT as simple as resizing. it is padding the shorter edge to the longer, pouring processor.image_mean to the padding area. you can referring to code here in train.

the four floats are actually wa, ha, wb, hb, a is the left up corner and b is the right bottom corner, while w is width and h is height. it provides a rectangle on the padded image.

you can use this case to test:

{'id': 'VG_100K/2334275',
 'image': 'vg/VG_100K/2334275.jpg',
 'conversations': [
...
  {'from': 'human',
   'value': 'Please provide the bounding box coordinate of the region this sentence describes: 01 on the train.'},
  {'from': 'gpt', 'value': '[0.65, 0.29, 0.78, 0.43]'}
...]}

simply resizing would cause the number in this region to be incomplete, after padding it is correct.

image image

Hello, I also encountered the same problem. Does it mean that fine-tuning the description of coordinates in the dataset is based on a resolution of 336x336?

@Maxlinn
Copy link

Maxlinn commented Mar 25, 2024

The answer is yes, if you use a 336px clip. and if you use a 224px clip, it is 224px.

The corrdinates in the instructions are percentages(between 0-1) on a padded square image.

@gapjialin
Copy link

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants