Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Hierarchical Question-Image Co-Attention for Visual Question Answering

Train a Hierarchical Co-Attention model for Visual Question Answering. This current code can get 62.1 on Open-Ended and 66.1 on Multiple-Choice on test-standard split. For COCO-QA, this code can get 65.4 on Accuracy. For more information, please refer the paper


This code is written in Lua and requires Torch. The preprocssinng code is in Python, and you need to install NLTK if you want to use NLTK to tokenize the question.

You also need to install the following package in order to sucessfully run the code.


We have prepared everything for you ;)

Download Dataset

The first thing you need to do is to download the data and do some preprocessing. Head over to the data/ folder and run

For VQA:

$ python --download 1 --split 1

--download Ture means you choose to download the VQA data from the VQA website and --split 1 means you use COCO train set to train and validation set to evaluation. --split 2 means you use COCO train+val set to train and test set to evaluate. After this step, it will generate two files under the data folder. vqa_raw_train.json and vqa_raw_test.json


$ python --download 1 

This will download the COCO-QA dataset from here and generate two files under the data folder. cocoqa_raw_train.json and cocoqa_raw_test.json

Download Image Model

Here we use VGG_ILSVRC_19_layers model and Deep Residual network implement by Facebook model.

Head over to the image_model folder and run

$ python --download 'VGG' 

This will download the VGG_ILSVRC_19_layers model under image_model folder. To download the Deep Residual Model, you need to change the VGG to Residual.

Generate Image/Question Features

Head over to the prepro folder and run

For VQA:

$ python --input_train_json ../data/vqa_raw_train.json --input_test_json ../data/vqa_raw_test.json --num_ans 1000

to get the question features. --num_ans specifiy how many top answers you want to use during training. You will also see some question and answer statistics in the terminal output. This will generate two files in data/ folder, vqa_data_prepro.h5 and vqa_data_prepro.json.


$ python --input_train_json ../data/cocoqa_raw_train.json --input_test_json ../data/cocoqa_raw_test.json

COCO-QA use all the answers in train, so there is no --num_ans option. This will generate two files in data/ folder, cocoqa_data_prepro.h5 and cocoqa_data_prepro.json.

Then we are ready to extract the image features.

For VGG image feature:

$ th prepro_img_vgg.lua -input_json ../data/vqa_data_prepro.json -image_root /home/jiasenlu/data/ -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel

you can change the -gpuid, -backend and -batch_size based on your gpu.

For Deep Residual image feature:

Train the model

We have everything ready to train the VQA and COCO-QA model. Back to the main folder

th train.lua -input_img_train_h5 data/vqa_data_img_vgg_train.h5 -input_img_test_h5 data/vqa_data_img_vgg_test.h5 -input_ques_h5 data/vqa_data_prepro.h5 -input_json data/vqa_data_prepro.json -co_atten_type Alternating -feature_type VGG

to train Alternating co-attention model on VQA using VGG image feature. You can train the Parallel co-attention by setting -co_atten_type Parallel. The prallel co-attention usually takes more time than alternating co-attention.

  • Deep Residual Image Feature is 4 times larger than VGG feature, make sure you have enough RAM when you extract or load the features.
  • If you didn't have large RAM, replace the require 'misc.DataLoader' (Line 11 in train.lua) with require 'misc.DataLoaderDisk. The model will read the data directly from the hard disk (SSD prefered)


Evaluate using Pre-trained Model

The pre-trained model can be download here Note, if you use the vqa train model, you should use the corresponding json file form here

if you use the vqa train+val model, you should use the corresponding json file form here


To Evaluate VQA, you need to download the VQA evaluation tool. To evaluate COCO-QA, you can use script under metric/ folder. If you need to evaluate based on WUPS, download the evaluation script from here

VQA on Single Image with Free Form Question

We use iTorch to demo the visual question answering with pre-trained model. The script only does the basic tokenize, and please make sure the question is all lowercase, and split by "space".(it's better use NLTK to tokenize and transform the question, you can check the for more details.)

In the root folder, open itorch notebook, then you can load any image and ask question using the itorch notebook.

Some of the data file can be download at here

Attention Visualization


If you use this code as part of any published research, please acknowledge the following paper

author = {Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi},
title = {Hierarchical Question-Image Co-Attention for Visual Question Answering},
journal = {arXiv preprint arXiv:1606.00061v2},
year = {2016}

Attention Demo

teaser results