Skip to content

The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing

Notifications You must be signed in to change notification settings

ludanruan/CLIP4VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP4VLA

The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing CIP4VLA

Setup

conda create -n clip4vla python=3.7
conda activate clip4vla
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install tqdm transformers soundfile opencv-python boto3 ftfy pandas
pip install h5py librosa dominate

Preparation

Dataset

Download MSR-VTT from Baiduyun (passward:qhq7) unzip it withtar -zxvf msrvtt.tar.gz and place it in ./data. process the dataset with the following command:

python data_processor.py --extract_audios --load_video_into_frames
cd data/msrvtt
mv softlink.sh audios_16k/
cd audios_16k
bash softlink.sh

Pre-train

First download CLIP model(ViT-B-32.pt) from CLIP, and put it in ./weight.
To pretrain from scratch, first prepare the dataset of Howto100M and Audioset(follow the above command like preparing MSR-VTT). Then run the following command:

bash ./scripts/pretrain_howto100m_s1.sh

Fine-tune

Prepare the dataset of MSR-VTT or vatex with data_processor.py and then run the following command:

bash ./scripts/<dataset_task>/finetune_retrieval_vatex_pre_video.sh

About

The official code base of Accommodating Audio Modality in CLIP for Multimodal Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published