Image Captioning Transformer
Purpose of this project is to experiment with several approaches to image captioning using Transformers and multi-modal pre-trained models like ViLBERT. It is an extension of the pytorch/fairseq sequence modeling toolkit and inspired by elements of the following papers:
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. Multimodal transformer with multi-view visual representation for image captioning. arXiv preprint arXiv:1905.07841, 2019.
Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. 2019.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2015.
This project is still early work in progress. Basic training and usage of transformer-based image captioning models is possible but hyper-parameters have not been tuned yet. ViLBERT integration is still pending.
- Install NCCL for multi-GPU training.
- Install apex with the
--cuda_extoption for faster training.
- Create a conda environment with
conda env create -f environment.yml.
- Activate the conda environment with
conda activate fairseq-image-captioning.
Models are currently trained with the MS-COCO dataset. To setup the dataset for training, create an
in the project's root directory, download MS-COCO 2014
to the created
ms-coco directory and extract the archives there. The resulting directory structure should look like
ms-coco annotations images train2014 val2014
MS-COCO images are needed when training with the
--features grid command line option. Image features are then extracted
from a fixed 8 x 8 grid on the image. When using the
--features obj command line option image features are extracted
from detected objects (see also bottom-up attention in this paper).
Pre-computed features of detected objects (10-100 per image) are available in this repository.
You can also use this link for downloading them
directly (22 GB). After downloading, extract the
trainval.zip file, rename the
trainval directory to
move it to the
ms-coco directory. The
ms-coco/features directory should contain 4
ms-coco annotations features karpathy_test_resnet101_faster_rcnn_genome.tsv karpathy_train_resnet101_faster_rcnn_genome.tsv.0 karpathy_train_resnet101_faster_rcnn_genome.tsv.1 karpathy_val_resnet101_faster_rcnn_genome.tsv images train2014 val2014
For splitting the downloaded MS-COCO data into a training, validation and test set, Karpathy splits are used.
Split files have been copied from this location.
All pre-processing commands in the following sub-sections write their results to the
Converts MS-COCO captions into a format required for model training.
Converts MS-COCO images into a format required for model training. Only needed when training with the
command line option.
Pre-process object features
Converts pre-computed object features into a format required for model training. Only needed when training with the
--features obj command line option.
In addition to all fairseq command line options this project implements the following extensions:
--task captioning. Enables the image captioning functionality implemented by this project.
--features grid. Use image features extracted from an 8 x 8 grid. Inception v3 is used for extracting image features. Additionally use
--max-source-positions 64when using this option.
--features obj. Use image features extracted from detected objects as described in this paper. Additionally use
--max-source-positions 100when using this option.
--arch default-captioning-arch. Uses a transformer encoder to process image features (2 layers by default) and a transformer decoder to process image captions and encoder output (6 layers by default). The number of encoder and decoder layers can be adjusted with
--arch simplistic-captioning-arch. Uses the same decoder as in
default-captioning-archbut no transformer encoder. Image features are processed directly by the decoder after projecting them into a lower-dimensional space which can be controlled with
--encoder-embed-dim. Projection into lower-dimensional space can be skipped with
--feature-spatial-embeddings. Learns positional (spatial) embeddings of bounding boxes or grid tiles. Disabled by default. Positional embeddings are learned from the top-left and bottom-right coordinates of boxes/tiles and their relative sizes.
An example command for training a simple captioning model is:
python -m fairseq_cli.train \ --task captioning \ --arch simplistic-captioning-arch \ --features grid \ --features-dir output \ --captions-dir output \ --user-dir task \ --save-dir .checkpoints \ --optimizer nag \ --lr 0.001 \ --criterion cross_entropy \ --max-epoch 50 \ --max-tokens 1024 \ --max-source-positions 64 \ --encoder-embed-dim 512 \ --log-interval 10 \ --save-interval-updates 1000 \ --keep-interval-updates 3 \ --num-workers 2 \ --no-epoch-checkpoints \ --no-progress-bar
See Extensions for captioning-specific command line options. Checkpoints are written to a
.checkpoints/checkpoint_best.pt should be used for testing. Please note that the hyper-parameters used
here are just examples, they are not tuned yet.
Scripts for detailed model evaluation are not available yet but will come soon, together with an application that reads images from a directory to caption them. At the moment you can use the following simple demo application for captioning images of the validation dataset.
python demo.py \ --features grid \ --features-dir output \ --captions-dir output \ --user-dir task \ --tokenizer moses \ --bpe subword_nmt \ --bpe-codes output/codes.txt \ --beam 5 \ --path .checkpoints/checkpoint_best.pt \ --input demo/val-images.txt
Validation image IDs are read from demo/val-images.txt. This should produce an output containing something like
105537: A street sign hanging from the side of a metal pole. 130599: A man standing next to a giraffe statue. ...
python demo.py \ --features grid \ --features-dir output \ --captions-dir output \ --user-dir task \ --tokenizer moses \ --bpe subword_nmt \ --bpe-codes output/codes.txt \ --beam 5 \ --path checkpoint_demo.pt \ --input demo/val-images.txt
Two sample validation images and their produced captions are:
"A man standing next to a giraffe statue."
"A street sign hanging from the side of a metal pole."