Automatic image captioning model based on Caffe, using features from bottom-up attention.
Switch branches/tags
Nothing to show
Clone or download


Simple yet high-performing image captioning model using Caffe and python. Using image features from bottom-up attention, in July 2017 this model achieved state-of-the-art performance on all metrics of the COCO captions test leaderboard (SPICE 21.5, CIDEr 117.9, BLEU_4 36.9). The architecture (2-layer LSTM with attention) is described in Section 3.2 of:


If you use this code in your research, please cite our paper:

  author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
  title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  year = {2018}


This code is released under the MIT License (refer to the LICENSE file for details).

Requirements: software

  1. Important Please use the version of caffe provided as a submodule within this repository. It contains additional layers and features required for captioning.

  2. Requirements for Caffe and pycaffe (see: Caffe installation instructions)

    Note: Caffe must be built with support for Python layers and NCCL!

    # In your Makefile.config, make sure to have these lines uncommented
    USE_NCCL := 1
    # Unrelatedly, it's also recommended that you use CUDNN
    USE_CUDNN := 1
  3. Nvidia's NCCL library which is used for multi-GPU training

Requirements: hardware

By default, the provided training scripts assume that two gpus are available, with indices 0,1. Training on two gpus takes around 9 hours. Any NVIDIA GPU with 8GB or larger memory should be OK. Training scripts and prototxt files will require minor modifications to train on a single gpu (e.g. set iter_size to 2).

Demo - Using the model to predict on new images

Run install instructions 1-4 below, then use the notebook at scripts/demo.ipynb


All instructions are from the top level directory. To run the demo, should be only steps 1-4 required (remaining steps are for training a model).

  1. Clone the Up-Down-Captioner repository:

    # Make sure to clone with --recursive
    git clone --recursive

    If you forget to clone with the --recursive flag, then you'll need to manually clone the submodules:

    git submodule update --init --recursive
  2. Build Caffe and pycaffe:

    cd ./external/caffe
    # If you're experienced with Caffe and have all of the requirements installed
    # and your Makefile.config in place, then simply do:
    make -j8 && make pycaffe
  3. Build the COCO tools:

    cd ./external/coco/PythonAPI
  4. Add python layers and caffe build to PYTHONPATH:

    cd $REPO_ROOT
    export PYTHONPATH=${PYTHONPATH}:$(pwd)/layers:$(pwd)/lib:$(pwd)/external/caffe/python
  5. Build Ross Girshick's Cython modules (to run the demo on new images)

    cd $REPO_ROOT/lib
  6. Download Stanford CoreNLP (required by the evaluation code):

    cd ./external/coco-caption
  7. Download the MS COCO train/val image caption annotations. Extract all the json files into one folder $COCOdata, then create a symlink to this location:

    cd $REPO_ROOT/data
    ln -s $COCOdata coco
  8. Pre-process the caption annotations for training (building vocabs etc).

    cd $REPO_ROOT
    python scripts/
  9. Download or generate pretrained image features following the instructions below.

Pretrained image features


The captioner takes pretrained image features as input (and does not finetune). For best performance, bottom-up attention features should be used. Code for generating these features can be found here. For ease-of-use, we provide pretrained features for the MSCOCO dataset. Manually download the following tsv file and unzip to data/tsv/:

To make a test server submission, you would also need these features:

Alternatively, to generate conventional pretrained features from the ResNet-101 CNN:

  • Download the pretrained ResNet-101 model and save it in baseline/ResNet-101-model.caffemodel
  • Download the MS COCO train/val images, and extract them into data/images.
  • Run:


To train the model on the karpathy training set, and then generate and evaluate captions on the karpathy testing set (using bottom-up attention features):


Trained snapshots are saved under: snapshots/caption_lstm/

Logging outputs are saved under: logs/caption_lstm/

Generated caption outputs are saved under: outputs/caption_lstm/

Scores for the generated captions (on the karpathy test set) are saved under: scores/caption_lstm/

To train and evaluate the baseline using conventional pretrained features, follow the instructions above but replace caption_lstm with caption_lstm_baseline_resnet.


Results (using bottom-up attention features) should be similar to the numbers below (as reported in Table 1 of the paper).

Cross-Entropy Loss 77.2 36.2 27.0 56.4 113.5 20.3
CIDEr Optimization 79.8 36.3 27.7 56.9 120.1 21.4

Other useful scripts

  1. scripts/ The version of caffe provided as a submodule with this repo includes (amongst other things) a custom LSTMNode layer that enables sampling and beam search through LSTM layers. However, the resulting network architecture prototxt files are quite complicated. The file scripts/ scaffolds out network structures, such as those in experiments.

  2. layers/ The provided net.prototxt file uses a python data layer (layers/ that loads all training data (including image features) into memory. If you have insufficient system memory use this python data layer instead, by replacing module: "rcnn_layers" with module: "efficient_rcnn_layers" in experiments/caption_lstm/net.prototxt.

  3. scripts/ Basic script for plotting validation set scores during training.