Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

VINVL image captioning features #94

Open
EddieKro opened this issue Apr 21, 2021 · 13 comments
Open

VINVL image captioning features #94

EddieKro opened this issue Apr 21, 2021 · 13 comments

Comments

@EddieKro
Copy link

Hello!
I have a question about extracting region features for image captioning:

  • in VinVL paper, it states that 2048 region features are stacked with 6 positionally encoded features (bbox, its height & width).
    How exactly are 6 features encoded? I've explored COCO test images, but it is impossible to match boxes in test.label.tsv and ones in test.feature.tsv (different number of boxes). I've guessed that features are encoded like (x_top_left/img_height, y_top_left/img_height, ..,, box_height/image_height, box_width/image_width), but unfortunately it didn't work
@nihirv
Copy link

nihirv commented Apr 22, 2021

I was about to open an issue with a similar question... In fact I'm struggling to see how we can get the 2048/2054 dimension vector for captioning?

So it seems that in run_captioning.py#115:

features = np.frombuffer(base64.b64decode(feat_info['features']), np.float32).reshape((num_boxes, -1))

features will be of dimension 1027. Whereas if we look at the VQA example (run_vqa.py#413):

feat = np.frombuffer(base64.b64decode(arr[2]), dtype=np.float32).reshape((-1, self.args.img_feature_dim))

self.args.img_feature_dim = 2054.

With the IC code, we can't reshape it to (-1, 2054) as there are shape mismatches - although reshaping to (-1, 1027) is fine. But I'm confused as to where the 3 extra dimensions come from (assuming that 1024 is the feature dimension).

Would be also be good to get clarification on whether the number of feature boxes is different to the number of objects in the image (which comes from X.label.tsv) because the object list from X.label.tsv is a set as opposed to a list? (In which case the bounding boxes would only be valid for one instance of the object in the image?)

EDIT: So it seems that the pred files that are generated by running run_captioning.py contain the 2054 dimensional vectors 👍. To weigh in my opinion on your problem OP, maybe the feature vectors we are given have been processed by a model already? And thus we can't trivially recover the spatial positions?

@EddieKro
Copy link
Author

EddieKro commented Apr 26, 2021

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

@nihirv
Copy link

nihirv commented Apr 26, 2021

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

Thank you!!! Very useful information at a very timely time. 👍

@liutianling
Copy link

liutianling commented Apr 28, 2021

@EddieKro Can you give some demo about how to extract feature of a input image?
Or, how to do prediction with the input image ?
Thanks a lot.

@EddieKro
Copy link
Author

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

@liutianling
Copy link

@EddieKro Great Thanks for you reply and detail steps!
Thanks!
I will have a try!

@akkapakasaikiran
Copy link

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here.
Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

@DavidInWuhanChina
Copy link

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here.
Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

Can you show me the complete inference file?

@akkapakasaikiran
Copy link

Can you show me the complete inference file?

Sorry, I'm not sure I understand what you mean. The inference file I used was oscar/run_retrieval.py.

@Jennifer-6
Copy link

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

@akkapakasaikiran
Copy link

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

Follow this, this, and this, in that order (they link to each other in a chain). You basically have to create the file yourself.

@BigHyf
Copy link

BigHyf commented Oct 24, 2021

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?
@akkapakasaikiran @Jennifer-6
hello, have you ever solve this problem, can you tell me the detail about vinvl_x152c4.yaml

@ginlov
Copy link

ginlov commented Dec 27, 2021

@liutianling it's quite a process)

1. Extract image features for a folder of images using sg_benchmark as described [here](https://github.com/microsoft/scene_graph_benchmark/issues/7#issuecomment-819357369) (you'll have to create some `.tsv` and `.lineindex` files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.

2. sg_benchmark will create file `predictions.tsv`, where we will need features, boxes, and class and confidence for each box.

3. To run VinVL inference you'll have to create `feature.tsv`, `label.tsv` and `.yaml` file using info from `predictions.tsv`. Note that to add 6 additional features you need to know height and width of each image, which will be stored in `hw.tsv` file required for sg_benchmark. Here's the [gist with the example code](https://gist.github.com/EddieKro/903ad08e85d670ff2b140a888d8c67c0)

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

How can i organize caption.json for fine-tune new dataset

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants