Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objects Coordinate input #114

Closed
chenxwh opened this issue May 26, 2022 · 7 comments
Closed

Objects Coordinate input #114

chenxwh opened this issue May 26, 2022 · 7 comments
Assignees

Comments

@chenxwh
Copy link

chenxwh commented May 26, 2022

Hi,

Congratulations on the ICML acceptance!

I would like to feed the model several sets of coordinate information with the input image and ask question about the object specified in the coordinates, for example, what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing?, it is possible for OFA to attend to the objects with the coordinate information? If so, what would be the best input format for this?

Having read the paper I think the grounded captioning in the pre-training task might be most relevant, but I don't see such examples in the pretrain_data_examples, it is still not clear what the best practice to feed the model with multiple coord info in one example. Also I fail to replicate the results shown in Figure 10 from the Appendix, grounded question answering, which model was used for these? And is the input exactly in the format as shown under the images, e.g. what color is the car in the region? region: <loc301> <loc495> <loc501> <loc596>? I assume the 301, 495, 501, 596 are the x1 y1, x2, y2 coordinates? I tried to ask questions about regions this way on customised images but it does not seem to focus on the region provided.

Thanks!

@chenxwh chenxwh changed the title Coordinate input Objects Coordinate input May 26, 2022
@logicwong
Copy link
Member

@chenxwh Hi,

  1. This format what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing? is OK, but I think you need to collect some training data to finetune OFA;
  2. Grounded captioning (GC) is an inverse task of visual grounding (VG), so the VG samples in the pretrain_data_examples will be used for both GC and VG tasks;
  3. <loc301> <loc495> <loc501> <loc596> are not raw coordinates but quantized coordinates, you can reproduce the results in Figure 10 through this colab: https://colab.research.google.com/drive/1jogyZ-2rdHU3XxZOf3TBfhex1XHqX-1m?usp=sharing#scrollTo=A4zeA_MmgNqa

@chenxwh
Copy link
Author

chenxwh commented May 28, 2022

Hi @logicwong,

Thanks for the reply.

I see in the colab, the checkpoint is ofa_large_384.pt, is this the same one with the large model for download in the Model Card?

Although regarding the bounding box, in the paper it says
Apart from representing images, it is also essential to represent objects within images as there are a series of region related tasks. Following [61], we represent objects as a sequence of discrete tokens. To be more specific, for each object, we extract its label and its bounding box. The continuous corner coordinates of the bounding box are uniformly discretized to integers as location tokens <x1; y1; x2; y2>. To improve simplicity, we use a unified vocabulary for all the linguistic and visual tokens, including subwords, image codes, and location tokens.

It sounds like, for each image, a list of objects with the corresponding bounding boxes are taken as input? I wonder how the bounding box are used as input, whether you could point me to it in the code, since I see most tasks only take patched images as input?

Thank you in advance!

@logicwong
Copy link
Member

logicwong commented May 29, 2022

@chenxwh ofa_large_384.pt is the ckpt used in our paper. For the large model in Model Card, we continue pretrain ofa_large_384.pt with the resolution of 480 x 480 for better results. For how to use bounding boxes in our model, you can refer to data/pretrain_data/unify_dataset.py. The corresponding code snippets are as follows:

92F59AC3-14C0-433C-812C-0EB07C094ABD

1F9CB174-7159-42DE-B8AA-97B2ADF88564

@chenxwh
Copy link
Author

chenxwh commented May 29, 2022

@logicwong Hi,

  1. Thanks for the snippets, I have seen this too. Has the model been pre-trained with a LIST of detected objects and their bounding box as INPUT together with the patched images? Because that's what the paragraph in the paper sounds like. (Not the refcoco task.)

  2. Basically I am using OFA on images that ask questions regarding SEVERAL regions in the image in one question, I do have the bounding box for each object, and I find OFA struggles to associate the boxes with the region, so all the questions is for me to understand whether I could better use OFA for those task?

@logicwong
Copy link
Member

@chenxwh

  1. In grounded captioning (GC), we use the bounding box as encoder's input together with the patched images, but we haven't used both detected objects and their bounding boxes as encoder's input (except for the detection task, where we use detected objects and bounding box as decoder's input).

  2. Of course, you can construct a question with several regions, like what are person1 (corresponding to coord1) and person2 (corresponding to coord2) doing? (just remember to encode the object coordinates in parentheses as quantized coordinates). In addition, I think you can also make a copy of the encoder's input as the decoder's input (as we did in the VQA task, the code snippet is shown below), which would help you get better results.

9C1CA7A9-1932-4329-81A1-468F721F1B33

@chenxwh
Copy link
Author

chenxwh commented May 29, 2022

Hi @logicwong,

Thank you for the clarification, although I think in vqa the prompt_type is prev_output? so without the eos?

Another question of the code, not really related to coordinate input, I see there is patch_images_2 in unify_transformer.py but don't think it is mentioned/explained. What does this input take? Does it mean it supports multiple input images?

Thank you!

@logicwong
Copy link
Member

@chenxwh Oh... You are right, in VQA the prompt_type is prev_output (exclude eos). patch_images_2means it supports processing two input images, but we haven't tested this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants