AttnGrounder: Talking to Cars with Attention by Vivek Mittal.
Accepted at ECCV'20 C4AV Workshop. Talk2Car dataset used for this paper is available at https://talk2car.github.io/.
Abstract:
We propose Attention Grounder (AttnGrounder), a singlestage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
Preprocessed Talk2Car data is available at this link extract it under ln_data
folder. Download the images following instruction given at this link. Extract all the images in ln_data\images
folder. All the hyperparameters are set, just run the following command in working directory (if you face memory issue try decreasing the batch size).
python train_yolo.py --batch_size 14
Part of the code or models are from DMS, MAttNet, Yolov3, Pytorch-yolov3 and One Stage Grounding.