-
Notifications
You must be signed in to change notification settings - Fork 252
Generating label.tsv and feature.tsv from image #33
Comments
The information is kind of dispersed in the issues, I will summarize it here for anyone looking in the future. The features are extracted using the bottom up attention model from https://github.com/peteanderson80/bottom-up-attention. I am attaching the file that I used for this purpose and to generate label.tsv as well. You might have to change the code depending on your data location and format. I still had some issues with csv Dictwriter generating strings with single quote while json loads requiring it as double quotes in run_captioning.py. I made modifications to run_captioning.py to make it work. If you guys have a better solution, let me know. Finally to generate label.lineidx and feature.lineidx, make use of the following function |
Thanks ! |
@shravan1394, what is the command line you used to generate the caption after having the right features? Also, could you share the modifications to The generated |
After using this script to generate feature and label tsv files, and after resolving the issue with single-quotes, I received the following error
I solved it by removing |
@EByrdS you can convert the single quotes to double quotes following #49 (comment) or #49 (comment) |
Thanks for the summary of information here! To anyone wishing to extract features on custom datasets, stumbled on this thread, and potentially struggling with the caffe environment, I'd recommend using the docker env built from the lxmert. Follow the instructions to set up the environment, then rewrite the |
Hi guys, I am trying to generate my own features.tsv and labels.tsv for my dataset, but I am stuck at the following:
I have a slight confusion regarding what exactly these features are. Upon reading the "Oscar" paper, I can understand that per bounding box a feature vector is of type (v',z) where v' is P-dimensional (2048) and z is 6 dimensional (position).
I have a difficulty in understanding where do these 2048 features come from. Initially, I thought that these were from the FC-layer of Faster-R-CNN but upon checking the FC-layer size is 4096 in Faster-R-CNN.
The Oscar paper mentions, " Specifically, v and q are generated as follows. Given an image with K regions
of objects (normally over-sampled and noisy), Faster R-CNN [28] is used to extract the visual semantics of each region". I have a slight confusion regarding how are these K-regions determined. Are these K-image regions the bound-boxes output by Faster-RCNN?
I am relatively new to this area. Any help would be appreciated.
The text was updated successfully, but these errors were encountered: