Skip to content
Dense captioning with joint inference and visual context
Branch: master
Clone or download
Latest commit 3762b26 Dec 25, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
cmake make cmake find cuDNN on Mac OS Aug 18, 2016
coco-caption more code cleaning; add demo with sample model Jul 14, 2017
data add demo python script Jul 5, 2017
docker Update Dockerfile to cuDNN v5 May 16, 2016
docs Fix: docs/ glog broken link Oct 5, 2016
examples merge with master May 30, 2017
matlab show Caffe's version from MatCaffe Jan 23, 2016
models add missing prototxt Nov 30, 2018
python merge with master May 30, 2017
scripts NV changed path to cudnn Oct 1, 2016
src refactor some layers; all unit test passed May 31, 2017
tools merge with master May 30, 2017
.Doxyfile update doxygen config to stop warnings Sep 3, 2014
.gitignore Ignore Visual Studio Code files. Sep 27, 2016
.travis.yml Stop setting cache timeout in TravisCI Jul 15, 2016
CMakeLists.txt [build] (CMake) customisable Caffe version/soversion May 10, 2016 [docs] add which will appear on GitHub new Issue/PR p… Jul 30, 2015 clarify the license and copyright terms of the project Aug 7, 2014 installation questions -> caffe-users Oct 19, 2015
Makefile [build] set default BLAS include for OS X 10.11 Aug 18, 2016
Makefile.config.example Update README Dec 25, 2018

Dense Captioning with Joint Inference and Visual Context

This repo is the released code of dense image captioning models described in the CVPR 2017 paper:

  author       = "Linjie Yang and Kevin Tang and Jianchao Yang and Li-Jia Li",
  title        = "Dense Captioning with Joint Inference and Visual Context",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition (CVPR)",
  month        = "Jul",
  year         = "2017"

All code is provided for research purposes only and without any warranty. Any commercial use requires our consent. When using the code in your research work, please cite the above paper. Our code is adapted from the popular Faster-RCNN repo written by Ross Girshick, which is based on the open source deep learning framework Caffe. The evaluation code is adapted from COCO captioning evaluation code.


Compile Caffe

Please follow official guide. Support CUDA 7.5+, CUDNN 5.0+. Tested on Ubuntu 14.04.

Compile local libraries

cd lib


Download official sample model here. This model is the Twin-LSTM with late context fusion (fused by summation) described in the paper. To test the model, run the following command in the library root folder.

python ./lib/tools/ --image [IMAGE_PATH] --gpu [GPU_ID] --net [MODEL_PATH]

It will generate a folder named "demo" in the library root. Inside the "demo" folder, there will be an HTML page showing the predicted results.


Data preparation

For model training you will need to download the visual genome dataset from Visual Genome Website, either 1.0 or 1.2 is fine. Download pre-trained VGG16 model from link. Modify data paths in models/dense_cap/ and run it from the library root to generate training/validation/testing data.

Start training

Run models/dense_cap/ to start training. For example, to train a model with joint inference and visual context (late fusion, feature summation) on visual genome 1.0:

./models/dense_cap/ [GPU_ID] visual_genome late_fusion_sum [VGG_MODEL_PATH] 

It typically takes 3 days to finish training. Note that due to the limitation of Python, multi-GPU training is not available for this library. In this library, we only provide Twin-LSTM structure for joint inference and late fusion (with three different fusion operators: summation, multiplication, concatenation) for context fusion. Other structures described in the paper can be easily implemented by adapting the existing code.


Modify models/dense_cap/ according to the model you want to test. For example, if you want to test the provided sample model, it will look like this:

time ./lib/tools/ --gpu ${GPU_ID} \
  --def_feature models/${PT_DIR}/vgg_region_global_feature.prototxt \
  --def_recurrent models/${PT_DIR}/test_cap_pred_context.prototxt \
  --def_embed models/${PT_DIR}/test_word_embedding.prototxt \
  --net ${NET_FINAL} \
  --imdb ${TEST_IMDB} \
  --cfg models/${PT_DIR}/dense_cap.yml \

The sample model will get an mAP of around 9.05. Except the model path(NET_FINAL), the only thing you should change is def_recurrent, which should be models/${PT_DIR}/test_cap_pred_no_context.prototxt for models without context information and models/${PT_DIR}/test_cap_pred_context.prototxt for models with context fusion. If you want to test late fusion models with other fusion operators, you need to modify test_cap_pred_context.prototxt. Change the "local_global_fusion" layer to eltwise multiplication or concatenation accordingly. To visualize the result, you can add --vis to the end of the above script. It will generate html pages for each image visualizing the results under folder output/dense_cap/${TEST_IMDB}/vis.


If you have any questions regarding the repo, please send email to Linjie Yang (

You can’t perform that action at this time.