The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing CIP4VLA
conda create -n clip4vla python=3.7
conda activate clip4vla
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install tqdm transformers soundfile opencv-python boto3 ftfy pandas
pip install h5py librosa dominate
Download MSR-VTT from Baiduyun (passward:qhq7)
unzip it withtar -zxvf msrvtt.tar.gz
and place it in ./data
.
process the dataset with the following command:
python data_processor.py --extract_audios --load_video_into_frames
cd data/msrvtt
mv softlink.sh audios_16k/
cd audios_16k
bash softlink.sh
First download CLIP model(ViT-B-32.pt
) from CLIP, and put it in ./weight
.
To pretrain from scratch, first prepare the dataset of Howto100M and Audioset(follow the above command like preparing MSR-VTT). Then run the following command:
bash ./scripts/pretrain_howto100m_s1.sh
Prepare the dataset of MSR-VTT or vatex with data_processor.py
and then run the following command:
bash ./scripts/<dataset_task>/finetune_retrieval_vatex_pre_video.sh