This is the official code of CVPR 2023 paper "You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model" Arxiv. The old version code has been merged into official OFA repository. This repo will be continuously maintained.
- python 3.7.4
- pytorch 1.8.1
- torchvision 0.9.1
- JAVA 1.8 (for COCO evaluation)
git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt
See datasets.md and checkpoints.md.
Below we provide methods for training and inference on different tasks. We provide both pretrained OFA-Large and OFA-Base in checkpoints.md. The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the run_scripts/
folder.
We recommend that your workspace directory should be organized like this:
OFA/
├── checkpoints/
│ ├── ofa_base.pt
│ ├── ofa_large.pt
│ ├── caption_large_best_clean.pt
│ └── ...
├── criterions/
├── data/
├── dataset/
│ ├── caption_data/
│ ├── gigaword_data/
│ └── ...
├── fairseq/
├── models/
├── run_scripts/
├── tasks/
├── train.py
├── trainer.py
└── utils/
To ensure the efficiency of processing data, we did not store images with small files, but instead we encode them to base64 strings. Transforming image files to base64 strings is simple. Run the following code:
from PIL import Image
from io import BytesIO
import base64
img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str
We provide procedures to reproduce our results of image captioning on our paper below.
1. Prepare the Dataset & Checkpoints
Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile caption_data.zip
contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in caption_stage1_train.tsv
and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.
162365 12455 the sun sets over the trees beyond some docks. sky&&water&&dock&&pole /9j/4AAQSkZJ....UCP/2Q==
2. Finetuning
Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. Note that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set --freeze-resnet
to freeze the inner states of batch normalization.
cd run_scripts/caption nohup sh train_caption_stage1.sh > train_stage1.out & # stage 1, train with cross-entropy loss nohup sh train_caption_stage2.sh > train_stage2.out & # stage 2, load the best ckpt of stage1 and train with CIDEr optimization # If you need to finetune MuE model, please apply the following script nohup sh train_caption_stage1_base_MuE.sh > train_stage1.out & # The stage2 uses the same script above
3. Inference
Run the following commands to get your results and evaluate your model.
cd run_scripts/caption ; sh evaluate_caption.sh # inference & evaluate # If you want to evaluate your MuE Model sh evaluate_caption_base_MuE.sh # You can adjust img_thres, txt_thres, and decoder_thres to achieve better performance and speed trade-off.
We provide steps for you to reproduce our results in visual entailment. See the details below.
1. Prepare the Dataset & Checkpoints
Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.
252244149.jpg#1r1n 252244149 /9j/4AAQ...MD/2Q== a man in pink and gold is chewing on a wooden toothpick. a man in pink is chewing a toothpick on the subway. neutral
2. Finetuning
In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement.
cd run_scripts/snli_ve nohup sh train_snli_ve.sh > train_snli_ve.out & # finetune for snli_ve # If you need to finetune MuE model, please apply the following script nohup sh train_snli_ve_base_MuE.sh > train_snli_ve_MuE.out &
3. Inference
Run the following command to obtain the results.
cd run_scripts/snli_ve ; sh evaluate_snli_ve.sh dev # specify 'dev' or 'test' # If you want to evaluate your MuE Model sh evaluate_snli_ve_base_MuE.sh # You can adjust img_thres, txt_thres, and decoder_thres to achieve better performance and speed trade-off.
Feel free to submit Github issues or pull requests. Welcome to contribute to our project!
To contact us, never hestitate to send an email to shengkuntangwork@gmail.com
!
Please cite our paper if you find it helpful :)
@InProceedings{Tang_2023_CVPR,
author = {Tang, Shengkun and Wang, Yaqing and Kong, Zhenglun and Zhang, Tianchi and Li, Yao and Ding, Caiwen and Wang, Yanzhi and Liang, Yi and Xu, Dongkuan},
title = {You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {10781-10791}
}