VINVL image captioning features #94

EddieKro · 2021-04-21T08:59:17Z

Hello!
I have a question about extracting region features for image captioning:

in VinVL paper, it states that 2048 region features are stacked with 6 positionally encoded features (bbox, its height & width).
How exactly are 6 features encoded? I've explored COCO test images, but it is impossible to match boxes in test.label.tsv and ones in test.feature.tsv (different number of boxes). I've guessed that features are encoded like (x_top_left/img_height, y_top_left/img_height, ..,, box_height/image_height, box_width/image_width), but unfortunately it didn't work

The text was updated successfully, but these errors were encountered:

nihirv · 2021-04-22T10:16:52Z

I was about to open an issue with a similar question... In fact I'm struggling to see how we can get the 2048/2054 dimension vector for captioning?

So it seems that in run_captioning.py#115:

features = np.frombuffer(base64.b64decode(feat_info['features']), np.float32).reshape((num_boxes, -1))

features will be of dimension 1027. Whereas if we look at the VQA example (run_vqa.py#413):

feat = np.frombuffer(base64.b64decode(arr[2]), dtype=np.float32).reshape((-1, self.args.img_feature_dim))

self.args.img_feature_dim = 2054.

With the IC code, we can't reshape it to (-1, 2054) as there are shape mismatches - although reshaping to (-1, 1027) is fine. But I'm confused as to where the 3 extra dimensions come from (assuming that 1024 is the feature dimension).

Would be also be good to get clarification on whether the number of feature boxes is different to the number of objects in the image (which comes from X.label.tsv) because the object list from X.label.tsv is a set as opposed to a list? (In which case the bounding boxes would only be valid for one instance of the object in the image?)

EDIT: So it seems that the pred files that are generated by running run_captioning.py contain the 2054 dimensional vectors 👍. To weigh in my opinion on your problem OP, maybe the feature vectors we are given have been processed by a model already? And thus we can't trivially recover the spatial positions?

EddieKro · 2021-04-26T09:23:59Z

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

nihirv · 2021-04-26T10:57:20Z

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

Thank you!!! Very useful information at a very timely time. 👍

liutianling · 2021-04-28T03:25:06Z

@EddieKro Can you give some demo about how to extract feature of a input image?
Or, how to do prediction with the input image ?
Thanks a lot.

EddieKro · 2021-04-29T08:04:06Z

@liutianling it's quite a process)

Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

liutianling · 2021-04-29T09:54:26Z

＠EddieKro Great Thanks for you reply and detail steps!
Thanks!
I will have a try!

akkapakasaikiran · 2021-06-13T22:45:04Z

@liutianling it's quite a process)

Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.

sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.

To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here.
Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

DavidInWuhanChina · 2021-06-21T06:22:46Z

@liutianling it's quite a process)

Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.

sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.

To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here.
Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

Can you show me the complete inference file?

akkapakasaikiran · 2021-06-23T18:28:59Z

Can you show me the complete inference file?

Sorry, I'm not sure I understand what you mean. The inference file I used was oscar/run_retrieval.py.

Jennifer-6 · 2021-08-18T12:21:42Z

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

akkapakasaikiran · 2021-08-18T12:44:45Z

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

Follow this, this, and this, in that order (they link to each other in a chain). You basically have to create the file yourself.

BigHyf · 2021-10-24T14:01:08Z

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?
@akkapakasaikiran @Jennifer-6
hello, have you ever solve this problem, can you tell me the detail about vinvl_x152c4.yaml

ginlov · 2021-12-27T07:30:50Z

@liutianling it's quite a process)

1. Extract image features for a folder of images using sg_benchmark as described [here](https://github.com/microsoft/scene_graph_benchmark/issues/7#issuecomment-819357369) (you'll have to create some `.tsv` and `.lineindex` files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.

2. sg_benchmark will create file `predictions.tsv`, where we will need features, boxes, and class and confidence for each box.

3. To run VinVL inference you'll have to create `feature.tsv`, `label.tsv` and `.yaml` file using info from `predictions.tsv`. Note that to add 6 additional features you need to know height and width of each image, which will be stored in `hw.tsv` file required for sg_benchmark. Here's the [gist with the example code](https://gist.github.com/EddieKro/903ad08e85d670ff2b140a888d8c67c0)

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

How can i organize caption.json for fine-tune new dataset

stopmosk mentioned this issue Jun 24, 2021

Features Dimensionality pzzhang/VinVL#11

Closed

jontooy mentioned this issue Oct 19, 2021

coco datasets #98

Open

jontooy mentioned this issue Feb 16, 2022

How to run captioning on any image? E.g. how to prepare the test.yaml and other required files required by the run_captioning.py? #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VINVL image captioning features #94

VINVL image captioning features #94

EddieKro commented Apr 21, 2021

nihirv commented Apr 22, 2021 •

edited

Loading

EddieKro commented Apr 26, 2021 •

edited

Loading

nihirv commented Apr 26, 2021

liutianling commented Apr 28, 2021 •

edited

Loading

EddieKro commented Apr 29, 2021

liutianling commented Apr 29, 2021

akkapakasaikiran commented Jun 13, 2021

DavidInWuhanChina commented Jun 21, 2021

akkapakasaikiran commented Jun 23, 2021

Jennifer-6 commented Aug 18, 2021

akkapakasaikiran commented Aug 18, 2021

BigHyf commented Oct 24, 2021

ginlov commented Dec 27, 2021

VINVL image captioning features #94

VINVL image captioning features #94

Comments

EddieKro commented Apr 21, 2021

nihirv commented Apr 22, 2021 • edited Loading

EddieKro commented Apr 26, 2021 • edited Loading

nihirv commented Apr 26, 2021

liutianling commented Apr 28, 2021 • edited Loading

EddieKro commented Apr 29, 2021

liutianling commented Apr 29, 2021

akkapakasaikiran commented Jun 13, 2021

DavidInWuhanChina commented Jun 21, 2021

akkapakasaikiran commented Jun 23, 2021

Jennifer-6 commented Aug 18, 2021

akkapakasaikiran commented Aug 18, 2021

BigHyf commented Oct 24, 2021

ginlov commented Dec 27, 2021

nihirv commented Apr 22, 2021 •

edited

Loading

EddieKro commented Apr 26, 2021 •

edited

Loading

liutianling commented Apr 28, 2021 •

edited

Loading