Skip to content

ludanruan/CLIP4VLA

Repository files navigation

CLIP4VLA

The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing CIP4VLA

Setup

conda create -n clip4vla python=3.7
conda activate clip4vla
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install tqdm transformers soundfile opencv-python boto3 ftfy pandas
pip install h5py librosa dominate

Preparation

Dataset

Download MSR-VTT from Baiduyun (passward:qhq7) unzip it withtar -zxvf msrvtt.tar.gz and place it in ./data. process the dataset with the following command:

python data_processor.py --extract_audios --load_video_into_frames
cd data/msrvtt
mv softlink.sh audios_16k/
cd audios_16k
bash softlink.sh

Pre-train

First download CLIP model(ViT-B-32.pt) from CLIP, and put it in ./weight.
To pretrain from scratch, first prepare the dataset of Howto100M and Audioset(follow the above command like preparing MSR-VTT). Then run the following command:

bash ./scripts/pretrain_howto100m_s1.sh

Fine-tune

Prepare the dataset of MSR-VTT or vatex with data_processor.py and then run the following command:

bash ./scripts/<dataset_task>/finetune_retrieval_vatex_pre_video.sh

About

The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published