Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Simple yet high-performing image captioning model using Caffe and python. Using image features from bottom-up attention, in July 2017 this model achieved state-of-the-art performance on all metrics of the COCO captions test leaderboard (SPICE 21.5, CIDEr 117.9, BLEU_4 36.9). The architecture (2-layer LSTM with attention) is described in Section 3.2 of:


If you use this code in your research, please cite our paper:

  author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
  title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  year = {2018}


This code is released under the MIT License (refer to the LICENSE file for details).

Requirements: software

  1. Important Please use the version of caffe provided as a submodule within this repository. It contains additional layers and features required for captioning.

  2. Requirements for Caffe and pycaffe (see: Caffe installation instructions)

    Note: Caffe must be built with support for Python layers and NCCL!

    # In your Makefile.config, make sure to have these lines uncommented
    USE_NCCL := 1
    # Unrelatedly, it's also recommended that you use CUDNN
    USE_CUDNN := 1
  3. Nvidia's NCCL library which is used for multi-GPU training

Requirements: hardware

By default, the provided training scripts assume that two gpus are available, with indices 0,1. Training on two gpus takes around 9 hours. Any NVIDIA GPU with 8GB or larger memory should be OK. Training scripts and prototxt files will require minor modifications to train on a single gpu (e.g. set iter_size to 2).

Demo - Using the model to predict on new images

Run install instructions 1-4 below, then use the notebook at scripts/demo.ipynb


All instructions are from the top level directory. To run the demo, should be only steps 1-4 required (remaining steps are for training a model).

  1. Clone the Up-Down-Captioner repository:

    # Make sure to clone with --recursive
    git clone --recursive

    If you forget to clone with the --recursive flag, then you'll need to manually clone the submodules:

    git submodule update --init --recursive
  2. Build Caffe and pycaffe:

    cd ./external/caffe
    # If you're experienced with Caffe and have all of the requirements installed
    # and your Makefile.config in place, then simply do:
    make -j8 && make pycaffe
  3. Build the COCO tools:

    cd ./external/coco/PythonAPI
  4. Add python layers and caffe build to PYTHONPATH:

    cd $REPO_ROOT
    export PYTHONPATH=${PYTHONPATH}:$(pwd)/layers:$(pwd)/lib:$(pwd)/external/caffe/python
  5. Build Ross Girshick's Cython modules (to run the demo on new images)

    cd $REPO_ROOT/lib
  6. Download Stanford CoreNLP (required by the evaluation code):

    cd ./external/coco-caption
  7. Download the MS COCO train/val image caption annotations. Extract all the json files into one folder $COCOdata, then create a symlink to this location:

    cd $REPO_ROOT/data
    ln -s $COCOdata coco
  8. Pre-process the caption annotations for training (building vocabs etc).

    cd $REPO_ROOT
    python scripts/
  9. Download or generate pretrained image features following the instructions below.

Pretrained image features


The captioner takes pretrained image features as input (and does not finetune). For best performance, bottom-up attention features should be used. Code for generating these features can be found here. For ease-of-use, we provide pretrained features for the MSCOCO dataset. Manually download the following tsv file and unzip to data/tsv/:

To make a test server submission, you would also need these features:

Alternatively, to generate conventional pretrained features from the ResNet-101 CNN:

  • Download the pretrained ResNet-101 model and save it in baseline/ResNet-101-model.caffemodel
  • Download the MS COCO train/val images, and extract them into data/images.
  • Run:


To train the model on the karpathy training set, and then generate and evaluate captions on the karpathy testing set (using bottom-up attention features):


Trained snapshots are saved under: snapshots/caption_lstm/

Logging outputs are saved under: logs/caption_lstm/

Generated caption outputs are saved under: outputs/caption_lstm/

Scores for the generated captions (on the karpathy test set) are saved under: scores/caption_lstm/

To train and evaluate the baseline using conventional pretrained features, follow the instructions above but replace caption_lstm with caption_lstm_baseline_resnet.


Results (using bottom-up attention features) should be similar to the numbers below (as reported in Table 1 of the paper).

Cross-Entropy Loss 77.2 36.2 27.0 56.4 113.5 20.3
CIDEr Optimization 79.8 36.3 27.7 56.9 120.1 21.4

Other useful scripts

  1. scripts/ The version of caffe provided as a submodule with this repo includes (amongst other things) a custom LSTMNode layer that enables sampling and beam search through LSTM layers. However, the resulting network architecture prototxt files are quite complicated. The file scripts/ scaffolds out network structures, such as those in experiments.

  2. layers/ The provided net.prototxt file uses a python data layer (layers/ that loads all training data (including image features) into memory. If you have insufficient system memory use this python data layer instead, by replacing module: "rcnn_layers" with module: "efficient_rcnn_layers" in experiments/caption_lstm/net.prototxt.

  3. scripts/ Basic script for plotting validation set scores during training.