MuE - You Need Multiple Exiting

This is the official code of CVPR 2023 paper "You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model" Arxiv. The old version code has been merged into official OFA repository. This repo will be continuously maintained.

Requirements

python 3.7.4
pytorch 1.8.1
torchvision 0.9.1
JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Training & Inference

Below we provide methods for training and inference on different tasks. We provide both pretrained OFA-Large and OFA-Base in checkpoints.md. The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the run_scripts/ folder.

We recommend that your workspace directory should be organized like this:

OFA/
├── checkpoints/
│   ├── ofa_base.pt
│   ├── ofa_large.pt
│   ├── caption_large_best_clean.pt
│   └── ...
├── criterions/
├── data/
├── dataset/
│   ├── caption_data/
│   ├── gigaword_data/
│   └── ...
├── fairseq/
├── models/
├── run_scripts/
├── tasks/
├── train.py
├── trainer.py
└── utils/

Image Processing

To ensure the efficiency of processing data, we did not store images with small files, but instead we encode them to base64 strings. Transforming image files to base64 strings is simple. Run the following code:

from PIL import Image
from io import BytesIO
import base64

img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str

Image Captioning

We provide procedures to reproduce our results of image captioning on our paper below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile caption_data.zip contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in caption_stage1_train.tsv and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.

162365  12455   the sun sets over the trees beyond some docks.  sky&&water&&dock&&pole  /9j/4AAQSkZJ....UCP/2Q==

2. Finetuning

Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. Note that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set --freeze-resnet to freeze the inner states of batch normalization.

cd run_scripts/caption
nohup sh train_caption_stage1.sh > train_stage1.out &  # stage 1, train with cross-entropy loss
nohup sh train_caption_stage2.sh > train_stage2.out &  # stage 2, load the best ckpt of stage1 and train with CIDEr optimization 
# If you need to finetune MuE model, please apply the following script
nohup sh train_caption_stage1_base_MuE.sh > train_stage1.out &
# The stage2 uses the same script above

3. Inference

Run the following commands to get your results and evaluate your model.

cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate
# If you want to evaluate your MuE Model
sh evaluate_caption_base_MuE.sh
# You can adjust img_thres, txt_thres, and decoder_thres to achieve better performance and speed trade-off.

Visual Entailment

We provide steps for you to reproduce our results in visual entailment. See the details below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.

252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral

2. Finetuning

In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement.

cd run_scripts/snli_ve
nohup sh train_snli_ve.sh > train_snli_ve.out &  # finetune for snli_ve
# If you need to finetune MuE model, please apply the following script
nohup sh train_snli_ve_base_MuE.sh > train_snli_ve_MuE.out &

3. Inference

Run the following command to obtain the results.

cd run_scripts/snli_ve ; sh evaluate_snli_ve.sh dev  # specify 'dev' or 'test'
# If you want to evaluate your MuE Model
sh evaluate_snli_ve_base_MuE.sh
# You can adjust img_thres, txt_thres, and decoder_thres to achieve better performance and speed trade-off.

Getting Involved

Feel free to submit Github issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to shengkuntangwork@gmail.com!

Citation

Please cite our paper if you find it helpful :)

@InProceedings{Tang_2023_CVPR,
    author    = {Tang, Shengkun and Wang, Yaqing and Kong, Zhenglun and Zhang, Tianchi and Li, Yao and Ding, Caiwen and Wang, Yanzhi and Liang, Yi and Xu, Dongkuan},
    title     = {You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {10781-10791}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
criterions		criterions
data		data
examples		examples
fairseq		fairseq
models		models
ofa_module		ofa_module
run_scripts		run_scripts
tasks		tasks
utils		utils
LICENSE		LICENSE
README.md		README.md
README_EncouragingLoss.md		README_EncouragingLoss.md
README_mmspeech.md		README_mmspeech.md
checkpoints.md		checkpoints.md
checkpoints_cn.md		checkpoints_cn.md
colab.md		colab.md
datasets.md		datasets.md
evaluate.py		evaluate.py
modelscope.md		modelscope.md
prompt_tuning.md		prompt_tuning.md
requirements.txt		requirements.txt
spaces.md		spaces.md
train.py		train.py
trainer.py		trainer.py
transformers.md		transformers.md

License

ncsu-dk-lab/MuE

Folders and files

Latest commit

History

Repository files navigation

MuE - You Need Multiple Exiting

Requirements

Installation

Datasets and Checkpoints

Training & Inference

Image Processing

Image Captioning

Visual Entailment

Getting Involved

Citation

About

Resources

License

Stars

Watchers

Forks

Languages