Image Captioners Are Scalable Vision Learners Too

by Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer [arxiv]

This directory contains a config for training a CapPa model from scratch. Note that most models in the paper were trained on a proprietary dataset (WebLI), but similar results can be obtained by training on LAION.

By default, this config trains on COCO captions as this data set is readily available in TFDS without manual steps. This is not meant to produce a meaningful model, but provides a way for the user to run the config out of the box. Please update the config with with a TFDS-wrapped variant of your favorite image/text data set to train capable models.

After setting up big_vision as described in the main README, training can be launched as follows

python -m big_vision.trainers.proj.cappa.generative \
  --config big_vision/configs/proj/cappa/pretrain.py \
  --workdir gs://$GS_BUCKET_NAME/big_vision/`date '+%m-%d_%H%M'`

To run the Cap baseline (autoregressive captioning without parallel prediction), set config.model.masked_pred_prob = 0.0.

Citation

@inproceedings{tschannen2023image,
  title={Image Captioners Are Scalable Vision Learners Too},
  author={Tschannen, Michael and Kumar, Manoj and Steiner, Andreas and Zhai, Xiaohua and Houlsby, Neil and Beyer, Lucas},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Image Captioners Are Scalable Vision Learners Too

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Image Captioners Are Scalable Vision Learners Too

Citation