Skip to content
No description, website, or topics provided.
Branch: master
Clone or download

UpDown Captioner Baseline for nocaps

Baseline model for nocaps benchmark, a re-implementation based on the UpDown image captioning model trained on the COCO dataset (only).

Checkout our package documentation at!

predictions generated by updown model

If you find this code useful, please consider citing:

  author  = {Harsh Agrawal* and Karan Desai* and Yufei Wang and Xinlei Chen and Rishabh Jain and
             Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
  title   = {{nocaps}: {n}ovel {o}bject {c}aptioning {a}t {s}cale},
  journal = {arXiv preprint arXiv:1812.08658},
  year    = {2018},

As well as the paper that proposed this model:

  author    = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson
               and Stephen Gould and Lei Zhang},
  title     = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  booktitle = {CVPR},
  year      = {2018}

How to setup this codebase?

This codebase requires Python 3.6+ or higher. It uses PyTorch v1.1, and has out of the box support with CUDA 9 and CuDNN 7. The recommended way to set this codebase up is through Anaconda or Miniconda. However, it should work just as fine with VirtualEnv.

Install Dependencies

  1. Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.

  2. Clone the repository.

git clone
cd updown-baseline
  1. Create a conda environment and install all the dependencies, and this codebase as a package in development version.
conda create -n updown python=3.6
conda activate updown
pip install -r requirements.txt
python develop

Note: If evalai package install fails, install these packages and try again:

sudo apt-get install libxml2-dev libxslt1-dev

Now you can import updown from anywhere in your filesystem as long as you have this conda environment activated.

Download Image Features

We provide pre-extracted bottom-up features for COCO and nocaps splits. These are extracted using a Faster-RCNN detector pretrained on Visual Genome (Anderson et al. 2017). We extract features from 100 region proposals for an image, and select them based on a confidence threshold of 0.2 - we finally get 10-100 features per image (adaptive).

Download (or symlink) the image features under $PROJECT_ROOT/data directory:

coco_train2017, coco_val2017, nocaps_val, nocaps_test.

Download Annotations

Download COCO captions and nocaps val/test image info and arrange in a directory structure as follows:

    |-- coco
    |   +-- annotations
    |       |-- captions_train2017.json
    |       +-- captions_val2017.json
    +-- nocaps
        +-- annotations
            |-- nocaps_val_image_info.json
            +-- nocaps_test_image_info.json
  1. COCO captions:
  2. nocaps val image info:
  3. nocaps test image info:


Build caption vocabulary using COCO train2017 captions.

python scripts/ -c data/coco/captions_train2017.json -o data/vocabulary

Evaluation Server

nocaps val and test splits are held privately behind EvalAI. To evaluate on nocaps, create an account on EvalAI and get the auth token from profile details. Set the token through EvalAI CLI as follows:

evalai set_token <your_token_here>

You are all set to use this codebase!


We manage experiments through config files -- a config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments. Train a baseline UpDown Captioner with all the default hyperparameters as follows. This would reproduce results of the first row in nocaps val/test tables from our paper.

python scripts/ \
    --config-yml configs/updown_nocaps_val.yaml \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Refer updown/ for default hyperparameters. For other configurations, pass a path to config file through --config-yml argument, and/or a set of key-value pairs through --config-override argument. For example:

python scripts/ \
    --config-yml configs/updown_nocaps_val.yaml \
    --config-override OPTIM.BATCH_SIZE 250 \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Multi-GPU Training

Multi-GPU training is fully supported, pass GPU IDs as --gpu-ids 0 1 2 3.

Saving Model Checkpoints

This script serializes model checkpoints every few iterations, and keeps track of best performing checkpoint based on overall CIDEr score. Refer updown/utils/ for more details on how checkpointing is managed. A copy of configuration file used for a particular experiment is also saved under --serialization-dir.


This script logs loss curves and metrics to Tensorboard, log files are at --serialization-dir. Execute tensorboard --logdir /path/to/serialization_dir --port 8008 and visit localhost:8008 in the browser.

Evaluation and Inference

Generate predictions for nocaps val or nocaps test using a pretrained checkpoint:

python scripts/ \
    --config-yml /path/to/config.yaml \
    --checkpoint-path /path/to/checkpoint.pth \
    --output-path /path/to/save/predictions.json \
    --gpu-ids 0

Add --evalai-submit flag if you wish to submit the predictions directly to EvalAI and get results.


Pre-trained checkpoint with the provided config is available to download here:

  1. Checkpoint (.pth file):
  2. Predictions on nocaps val:
  3. Predictions on nocaps test:
in-domain near-domain out-of-domain overall
val 78.1 11.6 57.7 10.3 31.3 8.3 73.7 18.3 22.7 50.4 55.3 10.1
test 74.3 11.5 56.9 10.3 30.1 8.1 74.0 19.2 23.0 51.0 54.3 10.1
You can’t perform that action at this time.