Implementation of caption-image retrieval from the paper "Order-Embeddings of Images and Language"
JavaScript Python HTML CSS
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.


Theano implementation of caption-image retrieval from the paper "Order-Embeddings of Images and Language".

(If you're looking for the other experiments, the textual entailment code is in a separate branch, and the hypernym code is here)

Similar to visual-semantic-embedding of which this repository is a fork, we map images and their captions into a common vector space. The main difference, as explained in the paper, is that we model the caption-image relationship as an (asymmetric) partial order rather than a symmetric similarity relation.

The code differs from visual-semantic-embedding in a number of other ways, including using 10-crop averaged VGG features for the image representation, and adding a visualization server.


This code is written in python. To use it you will need:

  • Python 2.7
  • Theano 0.7
  • A recent version of NumPy and SciPy

Replicating the paper

Getting data

Download the dataset files (1 GB), including 10-crop VGG19 features, by running


Note that we use the splits produced by Andrej Karpathy. The full COCO dataset can be obtained here.

Unzip the downloaded file - if not in the project directory, you'll need to change the datasets_dir variable in

note for Toronto users: just run ln -s /ais/gobi1/vendrov/order/coco data/coco instead

Evaluating pre-trained models

Download two pre-trained models (the full model and the symmetric baseline, 124 MB) and associated visualization data by running


Unzip the file in the project directory, and evaluate by running

    import tools, evaluation
    model = tools.load_model('snapshots/order')
    evaluation.ranking_eval_5fold(model, split='test')

Computing image and sentence vectors

Suppose you have a list of strings that you would like to embed into the learned vector space. To embed them, run the following:

sentence_vectors = tools.encode_sentences(model, s, verbose=True)

Where s is the list of strings. Note that the strings should already be pre-tokenized, so that str.split() returns the tokens.

As the vectors are being computed, it will print some numbers. The code works by extracting vectors in batches of sentences that have the same length - so the number corresponds to the current length being processed. If you want to turn this off, set verbose=False when calling encode.

To encode images, run the following instead:

image_vectors = tools.encode_images(model, im)

Where im is a NumPy array of VGG features. Note that the VGG features were scaled to unit norm prior to training the models.

Training new models

To train your own models, simply run

import train

As the model trains, it will periodically evaluate on the development set and re-save the model each time performance on the development set increases. Once the models are saved, you can load and evaluate them in the same way as the pre-trained models.

train.trainer has many hyperparameters; see for the ones used in the paper. Descriptions of each hyperparameter follow:

Saving / Loading

  • name: a string describing the model, used for saving + visualization
  • save_dir: the location to save model snapshots
  • load_from: location of model from which to load existing parameters
  • dispFreq: How often to display training progress (in batches)
  • validFreq: How often to evaluate on the development set


  • data: The dataset to train on (currently only 'coco' is supported)
  • cnn: The name of the CNN features to use, if you want to evaluate different image features


  • dim: The dimensionality of the learned embedding space (also the size of the RNN state)
  • dim_image: The dimensionality of the image features. This will be 4096 for VGG
  • dim_word: The dimensionality of the learned word embeddings
  • encoder: The type of RNN to use to encode sentences (currently only 'gru' is supported)
  • margin: The margin used for computing the pairwise ranking loss


  • optimizer: The optimization method to use (currently only 'adam' is supported)
  • batch_size: The size of a minibatch.
  • max_epochs: The number of epochs used for training
  • lrate: Learning rate
  • grad_clip: Magnitude at which to clip the gradient

Training on different datasets

To train on a different dataset, put tokenized sentences and image features in the same format as those provided for COCO, add the relevant paths to, and modify to handle your dataset correctly.

If you're training on Flickr8k or Flickr30k, just put Karpathy's dataset_flickr{8,30}k.json file in the dataset directory, and run the scripts and The latter script requires a working Caffe installation, as well as the VGG19 model spec and weights.

The evaluation ( and batching ( assume that there are exactly 5 captions per image; if your dataset doesn't have this property, you will need to modify them.


You can view plots of training errors and ranking metrics, as well as ROC curves for Image Retrieval, by running the visualization server. See the vis directory for more details.


If you found this code useful, please cite the following paper:

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun. "Order-Embeddings of Images and Language." arXiv preprint arXiv:1511.06361 (2015).

  title={Order-embeddings of images and language},
  author={Vendrov, Ivan and Kiros, Ryan and Fidler, Sanja and Urtasun, Raquel},
  journal={arXiv preprint arXiv:1511.06361},


Apache License 2.0