Skip to content

A Unified Framework for Video-Language Understanding

License

Notifications You must be signed in to change notification settings

microsoft/LAVENDER

Repository files navigation

[CVPR 2023] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Paper | Slide | Poster | Video

This repo is the offcial implementation of CVPR 2023 paper
"LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling"
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu and Lijuan Wang

We explore a unified video-language framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can

  • Seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned
  • Generalize to various downstream tasks with limited training samples
  • Enable zero-shot evaluation on video question answering tasks

Table of contents

Requirements

This code is largely based on the official pytorch implementation of VIOLET, implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Data preprocessing

Copied from VIOLET

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 5 frames for both pre-training and downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

  • Visit Video Swin Transformer to download pre-trained weights models. Place swin_base_patch244_window877_kinetics*_22k.pth under ${REPO_DIR}/_models/video_swin_transformer directory. The data structure should follow the hierarchy below.
    ${REPO_DIR}  
    |-- _models  
    |   |-- video_swin_transformer
    |    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
    |    |   |-- swin_base_patch244_window877_kinetics400_22k.pth
    |-- _args 
    |-- _datasets
    |-- _imgs 
    |-- ... 
    |-- ... 
    
  • Download pretraining datasets (WebVid2.5M & CC3M) provided by VIOLET to ./_datasets. The data structure should follow the hierarchy below.
    ${REPO_DIR}  
    |-- _models 
    |-- _args 
    |-- _datasets
    |   |-- txt_webvid2.5.json
    |   |-- webvid2.5_val.tsv
    |   |-- webvid2.5_val.lineidx
    |   |-- webvid2.5_train_1.tsv
    |   |-- webvid2.5_train_1.lineidx
    |   |-- ...
    |   |-- webvid2.5_train_9.tsv
    |   |-- webvid2.5_train_9.lineidx
    |   |-- txt_cc3m.json
    |   |-- cc3m_val.tsv
    |   |-- cc3m_val.lineidx
    |   |-- cc3m_train_1.tsv
    |   |-- cc3m_train_1.lineidx
    |   |-- ...
    |   |-- cc3m_train_9.tsv
    |   |-- cc3m_train_9.lineidx
    |-- _imgs 
    |-- ... 
    |-- ... 
    
  • Pretrain via single-node multi-gpu distributed training.

Task-specific Baseline: Pre-training with Video-Text Matching (VTM) + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
  • Pretrained checkpoint on WebVid2.5M+CC3M: link

LAVENDER: Unified Pre-training with VTM as MLM + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
  • Pretrained checkpoint on WebVid2.5M+CC3M: link
  • Scale-up pre-trained checkpoint with 14M videos + 16M images: link

Downstream

Download downstream datasets to ./_datasets.

Multiple-Choice Question Answering

  • TGIF-Action
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-action.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • TGIF-Transition
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-transition.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSRVTT-MC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_msrvtt-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • LSMDC-MC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_lsmdc-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, update the main script to main_qamc_task_specific.py or main_retmc_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Open-Ended Question Answering

  • TGIF-Frame
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_tgif-frame.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSRVTT-QA
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msrvtt-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSVD-QA
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msvd-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • LSMDC-FiB
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm_lsmdc_fib.py --config _args/args_lsmdc-fib.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, update the main script to main_qaoe_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Text-to-Video Retrieval

  • MSRVTT
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_ckpt <path to the finetuned msrvtt-retrieval model ckpt>
    
  • DiDeMo
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
    
  • MSVD
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_ckpt <path to the finetuned msvd-retrieval model ckpt>
    
  • LSMDC
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
    

For task-specific baseline, update the main script to main_retrieval_task_specific.py or eval_retrieval_task_specific.py, and point --path_ckpt to task-specific checkpoints.

Video Captioning (MSRVTT, MSVD)

Finetuning on video captioning requires additional enviroment and dataset setup. We closely follow the instructions from SwinBERT. Please check their repo for more details.

Note that the data folder should have the following structure:

${REPO_DIR}  
    |-- _datasets  
    |   |-- MSRVTT-v2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |-- MSVD  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |-- ... 
    |-- ... 

Once the docker enviroment and the dataset has been setup correctly, run the following command for training.

  • MSRVTT
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msrvtt-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    
  • MSVD
    CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msvd-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
    

For task-specific baseline, simply update --path_ckpt to task-specific pre-trained weights.

Multi-task Training

Data Filtering

As mentioned in our paper, the testing splits of all above tasks may overlap. We perform a data filtering step first to remove the testing data of a task from the training data of other tasks.

python _tools/multi_task_vid_filter.py --dataset lsmdc

python _tools/multi_task_vid_filter.py --dataset msrvtt

python _tools/multi_task_vid_filter.py --dataset msvd 

python _tools/multi_task_vid_filter.py --dataset tgif

Training

CUDA_VISIBLE_DEVICES='0,1,2,3,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_multi_task_mlm.py --config _args/args_multi-task_all.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_multi_task_multi_head.py and point --path_ckpt to task-specific pre-trained weights.

Citation

If you find this code useful, please consider citing the following papers:

@inproceedings{li2023lavender, 
  author = {Linjie Li and Zhe Gan and Kevin Lin and Chung-Ching Lin and Ce Liu and Zicheng Liu and Lijuan Wang}, 
  title = {LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}
@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

License

Our research code is released under MIT license.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

A Unified Framework for Video-Language Understanding

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published