The official implementation of AAAI24 paper DGL:Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. With only training 0.83 MB parameters, we can surpass fully finetuning/PEFL methods in Text2Video Retrieval.
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@inproceedings{yang2024dgl,
title={DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval},
author={Yang, Xiangpeng and Zhu, Linchao and Wang, Xiaohan and Yang, Yi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={6540--6548},
year={2024}
}
- Oct 14 2024: Update code for qb norm and visualization.
- Feb 15 2024: Release the code of DGL.
Since the visualization code need to cache the global prompt on frame weights and we provide another code project for visualization, the full code is provided at visualization code
#unzip the code
#then replace pretrained_weight path(model_dir in mstvtt.sh)
python main.py
conda env create -f environment.yml
Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained
.
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
MSR-VTT Download the splits and captions from CLIP4clip:
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip
Download the videos from Frozen️-in-Time:
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip
Video preprocessing can be done by preprocess/compress_video.py.
python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]
This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.
Note that, due to hardware difference, the results may slightly differ. We have test the performance on A100 GPU with T2V/V2T R@1 is 45.8/43.5 log, on A6000 GPU with T2V/V2T R@1 is 45.4/44.1 log.
You can also only adapt global-local video attention with BLIP, following the implementation of tokenmix , you can get T2V/V2T R@1 is 48.9/49.0 log.
Checkpoint | CLIP | Shared Latent Space | Google Cloud |
---|---|---|---|
MSR-VTT | ViT-B/32 | Transformer | Download |
MSR-VTT | ViT-B/16 | Transformer | Download |
VATEX | ViT-B/32 | Linear | Download |
LSMDC | ViT-B/32 | Linear | Download |
ActivityNet | ViT-B/32 | Transformer | Download |
#eval in MSRVTT
#set
do_train=0
do_eval=1
shared_latent_space=transformer/linear
resume='path of ckpt.best.pth.tar'
bash scripts/msrvtt.sh
Prepare sim matrix and train_test t2v and v2t, search for your best T2V/V2T R@1!
#Search for best performance using QB norm
#set prepare sim matrix in the folder, i,e, msrvtt_vit16_sim_matrix.npy, msrvtt_vit16_train_test_t2v.npy, msrvtt_vit16_train_test_v2t.npy
python search_for_best_r1_with_qb_norm.py
#set
shared_latent_space=transformer/linear
#For DGL-Linear, your can only training with 0.83 MB parameters.
#MSR-VTT
scripts/msrvtt.sh
# VATEX
scripts/vatex.sh
# LSMDC
scripts/lsmdc.sh
# ActivityNet
scripts/activitynet.sh
This repo is built upon these previous works.