Contains approaches introduced in the MovieQA benchmark dataset paper

MovieQA - Answering (CVPR 2016)

MovieQA: Understanding Stories in Movies through Question-Answering
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler
Computer Vision and Pattern Recognition (CVPR), June 2016.
Project page | Read the paper | Register and download

This repository contains approaches introduced in the above paper.

To replicate the different models and results please follow the instructions below:

Add the dataset folder to path

Change the path in

Encode QAs and Text

Represent the QAs and different text sources using TFIDF (vocabulary embedding), Word2Vec (word embedding), and SkipThoughts (sentence embedding).


  • Word2Vec model trained on 1364 movie plot synopses. Download here and store to "models" folder
  • Skip-Thought encoder. Github repo Please follow instructions on that repository. To encode using GPU (for SkipThoughts) you may want to use THEANO_FLAGS=device=gpu python

Hasty Student

Try to answer questions without looking at the story. This allows to analyze the bias in the dataset collection. We evaluate different options here including answering based on: (i) length of the answers; (ii) within answer similarity or distinctness; and (iii) question-answer similarity.

python -h

Searching Student

Answer questions by searching for the best matching (set) of story sentences to the question and answer option. We evaluate different options here such as: (i) story sources: split_plot, subtitle, script, dvs; (ii) representations: tfidf, word2vec, skipthought; and (iii) window size for the story.

python -h

Searching Student with Convolutional Brain

Emulates searching student, but with multiple 1x1 convolutional layers that combine similarity between <story,question> and <story,answer> products. Uses max and average pooling across sentences at the end. We evaluate different options including: (i) story sources: split_plot, subtitle, script, dvs; and (ii) representations: tfidf, word2vec, skipthought.

python -h

Inputs contain an option for pad_percentile. This controls the truncation or padding of zeros to stories to create batches. The value can be in the range of 90 to 100 (default) depending on your GPU memory.

Quirks: SSCB seems to be fairly sensitive to initialization. We overcome this issue by training several networks (random start) and pick the model that shows best performance on the internal dev set.

Modified Memory Networks

Answer questions using a modified version of the End-To-End Memory Network arXiv. The modifications include use of a fixed word embedding layer along with a shared linear projection, and the ability to pick one among multiple-choice multi-word answers. The memory network supports answering in all sources. The main options to run this program are: (i) story sources: split_plot, subtitle, script, dvs; (ii) number of memory layers (although this did not affect performance much); and (iii) training parameters: batch size, learning rate, #epochs.

For more details please refer to:

python -h

Video-based Answering

Releasing code for this is fairly complicated as it comes from several projects. Still working on updating this.

Requirements (in one place)

  • Word2Vec: Python installation using pip install word2vec
  • SkipThoughts: Github repo
  • Theano: Github repo, tested on some versions of Theano-0.7 and 0.8.
  • Lasagne: Github repo, tested with Theano-0.8 and Lasagne-0.1 for SSCB.
  • scikit-learn (PCA)
  • python-gflags
  • progressbar
  • optparse
  • nltk
  • scipy
  • numpy