by Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer [arxiv]
This directory contains a config for training a CapPa model from scratch. Note that most models in the paper were trained on a proprietary dataset (WebLI), but similar results can be obtained by training on LAION.
By default, this config trains on COCO captions as this data set is readily available in TFDS without manual steps. This is not meant to produce a meaningful model, but provides a way for the user to run the config out of the box. Please update the config with with a TFDS-wrapped variant of your favorite image/text data set to train capable models.
After setting up big_vision
as described in the main README, training can be launched as follows
python -m big_vision.trainers.proj.cappa.generative \
--config big_vision/configs/proj/cappa/pretrain.py \
--workdir gs://$GS_BUCKET_NAME/big_vision/`date '+%m-%d_%H%M'`
To run the Cap baseline (autoregressive captioning without parallel prediction),
set config.model.masked_pred_prob = 0.0
.
@inproceedings{tschannen2023image,
title={Image Captioners Are Scalable Vision Learners Too},
author={Tschannen, Michael and Kumar, Manoj and Steiner, Andreas and Zhai, Xiaohua and Houlsby, Neil and Beyer, Lucas},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2023}
}