Skip to content
Visual question answering for CVPR16 VQA Challenge.
Branch: master
Clone or download
Latest commit 977e79c Nov 5, 2016

README.md

Visual Question Answering

This is the project page of the UPC team participating in the VQA challenge for CVPR 2016. Details of the proposed solutions will be posted after the deadline.

Issey Masuda Mora Xavier Giró-i-Nieto Santiago Pascual de la Puente
Main contributor Advisor Co-advisor
Issey Masuda Mora Xavier Giró-i-Nieto Santiago Pascual de la Puente

Institution: Universitat Politècnica de Catalunya.

Universitat Politècnica de Catalunya

News

A Deep Learning abstract framework has been published in beta version at DeepFramework.

Abstract

Deep learning techniques have been proven to be a great success for some basic perceptual tasks like object detection and recognition. They have also shown good performance on tasks such as image captioning but these models are not that good when a higher reasoning is needed.

Visual Question-Answering tasks require the model to have a much deeper comprehension and understanding of the scene and the realtions between the objects in it than that required for image captioning. The aim of these tasks is to be able to predict an answer given a question related to an image.

Different scenarios have been proposed to tackle this problem, from multiple-choice to open-ended questions. Here we have only addressed the open-ended model.

What are you going to find here

This project gives a baseline code to start developing on Visual Question-Answering tasks, specifically those focused on the VQA challenge. Here you will find an example on how to implement your models with Keras and train, validate and test them on the VQA dataset. Note that we are still building things upon this project so the code is not ready to be imported as a module but we would like to share it with the community to give a starting point for newcomers.

Dependencies

This project is build using the Keras library for Deep Learning, which can use as a backend both Theano and TensorFlow.

We have used Theano in order to develop the project and it has not been tested with TensorFlow.

For a further and more complete of all the dependencies used within this project, check out the requirements.txt provided within the project. This file will help you to recreate the exact same Python environment that we worked with.

Project structure

The main files/dir are the following:

  • bin: all the scripts that uses the vqa module are here. This is your entry point.
  • data: the get_dataset.py script will download the whole dataset for you and place it where it can access it. Alternatively, you can provide the route of the dataset if you have already downloaded it. The vqa module created some directory structure to place preprocessed files
  • vqa: this is a python package with the core of the project
  • requirements.txt: to be able to reproduce the python environment. You only need to do pip install in the project's root and it will install all the dependencies needed

The model

We have participated into the VQA challenge with the following model.

Our model is composed of two branches, one leading with the question and the other one with the image, that are merged to predict the answer. The question branch takes the question as tokens and obtains the word embedding of each token. Then, we feed these word embeddings into a LSTM and we take its last state (once it has seen all the question) as our question representation, which is a sentence embedding. For the image branch, we have first precomputed the visual features of the images with a Kernalized CNNs (KCNNs) [Liu 2015]. We project these features into the same space as the question embedding using a fully-connected layer with ReLU activation function.

Once we have both the visual and text features, we merge them suming both vectors as they belong to the same space. This final representation is given to another fully-connected layer softmax to predict the answer, which will be a one-hot representation of the word (we are predicting a single word as our answer).

Model architecture

Related work

Acknowledgements

We would like to especially thank Albert Gil Moreno and Josep Pujal from our technical support team at the Image Processing Group at the UPC.

Albert Gil Josep Pujal
Albert Gil Josep Pujal

Contact

If you have any general doubt about our work or code which may be of interest for other researchers, please use the issues section on this github repo. Alternatively, drop us an e-mail at xavier.giro@upc.edu.

You can’t perform that action at this time.