Objects Coordinate input #114

chenxwh · 2022-05-26T18:28:02Z

Hi,

Congratulations on the ICML acceptance!

I would like to feed the model several sets of coordinate information with the input image and ask question about the object specified in the coordinates, for example, what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing?, it is possible for OFA to attend to the objects with the coordinate information? If so, what would be the best input format for this?

Having read the paper I think the grounded captioning in the pre-training task might be most relevant, but I don't see such examples in the pretrain_data_examples, it is still not clear what the best practice to feed the model with multiple coord info in one example. Also I fail to replicate the results shown in Figure 10 from the Appendix, grounded question answering, which model was used for these? And is the input exactly in the format as shown under the images, e.g. what color is the car in the region? region: <loc301> <loc495> <loc501> <loc596>? I assume the 301, 495, 501, 596 are the x1 y1, x2, y2 coordinates? I tried to ask questions about regions this way on customised images but it does not seem to focus on the region provided.

Thanks!

The text was updated successfully, but these errors were encountered:

logicwong · 2022-05-27T12:40:55Z

@chenxwh Hi,

This format what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing? is OK, but I think you need to collect some training data to finetune OFA;
Grounded captioning (GC) is an inverse task of visual grounding (VG), so the VG samples in the pretrain_data_examples will be used for both GC and VG tasks;
<loc301> <loc495> <loc501> <loc596> are not raw coordinates but quantized coordinates, you can reproduce the results in Figure 10 through this colab: https://colab.research.google.com/drive/1jogyZ-2rdHU3XxZOf3TBfhex1XHqX-1m?usp=sharing#scrollTo=A4zeA_MmgNqa

chenxwh · 2022-05-28T21:12:06Z

Hi @logicwong,

Thanks for the reply.

I see in the colab, the checkpoint is ofa_large_384.pt, is this the same one with the large model for download in the Model Card?

Although regarding the bounding box, in the paper it says
Apart from representing images, it is also essential to represent objects within images as there are a series of region related tasks. Following [61], we represent objects as a sequence of discrete tokens. To be more specific, for each object, we extract its label and its bounding box. The continuous corner coordinates of the bounding box are uniformly discretized to integers as location tokens <x1; y1; x2; y2>. To improve simplicity, we use a unified vocabulary for all the linguistic and visual tokens, including subwords, image codes, and location tokens.

It sounds like, for each image, a list of objects with the corresponding bounding boxes are taken as input? I wonder how the bounding box are used as input, whether you could point me to it in the code, since I see most tasks only take patched images as input?

Thank you in advance!

logicwong · 2022-05-29T02:55:15Z

@chenxwh ofa_large_384.pt is the ckpt used in our paper. For the large model in Model Card, we continue pretrain ofa_large_384.pt with the resolution of 480 x 480 for better results. For how to use bounding boxes in our model, you can refer to data/pretrain_data/unify_dataset.py. The corresponding code snippets are as follows:

chenxwh · 2022-05-29T07:44:53Z

@logicwong Hi,

Thanks for the snippets, I have seen this too. Has the model been pre-trained with a LIST of detected objects and their bounding box as INPUT together with the patched images? Because that's what the paragraph in the paper sounds like. (Not the refcoco task.)
Basically I am using OFA on images that ask questions regarding SEVERAL regions in the image in one question, I do have the bounding box for each object, and I find OFA struggles to associate the boxes with the region, so all the questions is for me to understand whether I could better use OFA for those task?

logicwong · 2022-05-29T08:42:49Z

@chenxwh

In grounded captioning (GC), we use the bounding box as encoder's input together with the patched images, but we haven't used both detected objects and their bounding boxes as encoder's input (except for the detection task, where we use detected objects and bounding box as decoder's input).
Of course, you can construct a question with several regions, like what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing? (just remember to encode the object coordinates in parentheses as quantized coordinates). In addition, I think you can also make a copy of the encoder's input as the decoder's input (as we did in the VQA task, the code snippet is shown below), which would help you get better results.

chenxwh · 2022-05-29T16:55:52Z

Hi @logicwong,

Thank you for the clarification, although I think in vqa the prompt_type is prev_output? so without the eos?

Another question of the code, not really related to coordinate input, I see there is patch_images_2 in unify_transformer.py but don't think it is mentioned/explained. What does this input take? Does it mean it supports multiple input images?

Thank you!

logicwong · 2022-05-30T02:37:55Z

@chenxwh Oh... You are right, in VQA the prompt_type is prev_output (exclude eos). patch_images_2means it supports processing two input images, but we haven't tested this feature.

chenxwh changed the title ~~Coordinate input~~ Objects Coordinate input May 26, 2022

JustinLin610 assigned JustinLin610 and logicwong and unassigned JustinLin610 May 27, 2022

logicwong closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Objects Coordinate input #114

Objects Coordinate input #114

chenxwh commented May 26, 2022

logicwong commented May 27, 2022

chenxwh commented May 28, 2022 •

edited

Loading

logicwong commented May 29, 2022 •

edited

Loading

chenxwh commented May 29, 2022 •

edited

Loading

logicwong commented May 29, 2022

chenxwh commented May 29, 2022 •

edited

Loading

logicwong commented May 30, 2022

Objects Coordinate input #114

Objects Coordinate input #114

Comments

chenxwh commented May 26, 2022

logicwong commented May 27, 2022

chenxwh commented May 28, 2022 • edited Loading

logicwong commented May 29, 2022 • edited Loading

chenxwh commented May 29, 2022 • edited Loading

logicwong commented May 29, 2022

chenxwh commented May 29, 2022 • edited Loading

logicwong commented May 30, 2022

chenxwh commented May 28, 2022 •

edited

Loading

logicwong commented May 29, 2022 •

edited

Loading

chenxwh commented May 29, 2022 •

edited

Loading

chenxwh commented May 29, 2022 •

edited

Loading