[Question] About the bounding box of the VG data #606

Deaddawn · 2023-10-18T10:59:52Z

Question

Hi, there. I'm wondering if the bounding box [x,y,w,h] value in VG data has been modified because of the resize of the training process? Can you please elaborate on this detail? Thanks a lot!

Deaddawn · 2023-10-18T11:00:14Z

it seems the range of it is [0,1]

Deaddawn · 2023-10-20T10:08:15Z

hi i've noticed that too, it seems to be ratio of the corrdinates divided by the picture side.

in vg's region notations, the region is provided in x, y, width, height, and the picture has width has height(let's say, im_width, im_height, you can obtain from PIL.Image.open(...).size).

so in llava instructions, it is

x / im_width

y / im_height

(x+width) / im_width

(y+height) / im_height

i did some calculations are verified in llava's data.

OK，谢了老兄，我去试试可视化

Deaddawn · 2023-10-20T10:34:29Z

hi i've noticed that too, it seems to be ratio of the corrdinates divided by the picture side.

in vg's region notations, the region is provided in x, y, width, height, and the picture has width has height(let's say, im_width, im_height, you can obtain from PIL.Image.open(...).size).

so in llava instructions, it is

x / im_width

y / im_height

(x+width) / im_width

(y+height) / im_height

i did some calculations are verified in llava's data.

It seems wrong, can you provide your exapmle?

Deaddawn · 2023-10-20T14:30:32Z

Is it possible the normalization is based on the image shape 336x336, since the cli inference return image tensor 336x336? Does that sounds more resonable? I have tried several samples, seems work.

Maxlinn · 2023-10-22T12:09:08Z

i did some serious digging and find out is does based on image shape 336x336(or any square), but it is NOT as simple as resizing. it is padding the shorter edge to the longer, pouring processor.image_mean to the padding area. you can referring to code here in train.

the four floats are actually wa, ha, wb, hb, a is the left up corner and b is the right bottom corner, while w is width and h is height. it provides a rectangle on the padded image.

you can use this case to test:

{'id': 'VG_100K/2334275',
 'image': 'vg/VG_100K/2334275.jpg',
 'conversations': [
...
  {'from': 'human',
   'value': 'Please provide the bounding box coordinate of the region this sentence describes: 01 on the train.'},
  {'from': 'gpt', 'value': '[0.65, 0.29, 0.78, 0.43]'}
...]}

simply resizing would cause the number in this region to be incomplete, after padding it is correct.

haotian-liu · 2023-10-23T21:12:10Z

Thank you @Maxlinn and your understanding is correct. We will need to clarify this in our revised paper to make it clearer.

gapjialin · 2024-03-25T06:41:17Z

i did some serious digging and find out is does based on image shape 336x336(or any square), but it is NOT as simple as resizing. it is padding the shorter edge to the longer, pouring processor.image_mean to the padding area. you can referring to code here in train.

the four floats are actually wa, ha, wb, hb, a is the left up corner and b is the right bottom corner, while w is width and h is height. it provides a rectangle on the padded image.

you can use this case to test:
{'id': 'VG_100K/2334275',
 'image': 'vg/VG_100K/2334275.jpg',
 'conversations': [
...
  {'from': 'human',
   'value': 'Please provide the bounding box coordinate of the region this sentence describes: 01 on the train.'},
  {'from': 'gpt', 'value': '[0.65, 0.29, 0.78, 0.43]'}
...]}
simply resizing would cause the number in this region to be incomplete, after padding it is correct.

Hello, I also encountered the same problem. Does it mean that fine-tuning the description of coordinates in the dataset is based on a resolution of 336x336?

Maxlinn · 2024-03-25T06:58:55Z

The answer is yes, if you use a 336px clip. and if you use a 224px clip, it is 224px.

The corrdinates in the instructions are percentages(between 0-1) on a padded square image.

gapjialin · 2024-03-25T07:00:30Z

Thank you!!

haotian-liu closed this as completed Oct 23, 2023

haotian-liu mentioned this issue Oct 24, 2023

[Question] region changed after expand2square #623

Closed

zhang-jr mentioned this issue Nov 1, 2023

Question about Referring QAs! BAAI-DCAI/Visual-Instruction-Tuning#9

Closed

Richar-Du mentioned this issue Nov 19, 2023

[Question] The processing method on bounding box of VG and RefCOCO are different? #822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] About the bounding box of the VG data #606

[Question] About the bounding box of the VG data #606

Deaddawn commented Oct 18, 2023 •

edited

Deaddawn commented Oct 18, 2023

Deaddawn commented Oct 20, 2023

Deaddawn commented Oct 20, 2023

Deaddawn commented Oct 20, 2023 •

edited

Maxlinn commented Oct 22, 2023 •

edited

haotian-liu commented Oct 23, 2023

gapjialin commented Mar 25, 2024

Maxlinn commented Mar 25, 2024

gapjialin commented Mar 25, 2024

[Question] About the bounding box of the VG data #606

[Question] About the bounding box of the VG data #606

Comments

Deaddawn commented Oct 18, 2023 • edited

Question

Deaddawn commented Oct 18, 2023

Deaddawn commented Oct 20, 2023

Deaddawn commented Oct 20, 2023

Deaddawn commented Oct 20, 2023 • edited

Maxlinn commented Oct 22, 2023 • edited

haotian-liu commented Oct 23, 2023

gapjialin commented Mar 25, 2024

Maxlinn commented Mar 25, 2024

gapjialin commented Mar 25, 2024

Deaddawn commented Oct 18, 2023 •

edited

Deaddawn commented Oct 20, 2023 •

edited

Maxlinn commented Oct 22, 2023 •

edited