DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction (CVPR 24)

Official implementation of DiffSal, a diffusion-based generalized audio-visual saliency prediction framework using simple MSE objective function.

Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang, Yufei Zha

🔥 News

May. 26, 2024. Training code released now! It's time to train DiffSal! 🚀🚀

📌 TODOs

release pretrained weights.

🔧 Setup

Environment Setup

Please install from requirements.txt .

conda create -n diff-sal python==3.10
conda activate diff-sal
pip install -r requirements.txt

🚅 How To Train

🎉 To make our method reproducible, we have released all of our training code. Four 4090 GPUs are enough :)

Data Structure

The DiffSal model needs to be pre-trained on the DHF1k dataset.

./data/dhf1k
    ├── frames
    │   ├── 1
    │   │   ├── 1.png
    │   │   ├── 2.png
    │   │   ...
    │   │   ├── 100.png
    ├── maps
    │   ├── 1
    │   │   ├── 0001.png 
    │   │   ├── 0002.png
    │   │   ...
    │   │   ├── 0100.png

The DiffSal model is then fine-tuned on the audio-visual dataset.

./data/video_frames
    ├── AVAD
    │   ├── V1_Speech1
    │   │   ├── img_00001.jpg
    │   │   ├── img_00002.jpg
    │   │   ...
    │   │   ├── img_00100.jpg
./data/video_audio
    ├── AVAD
    │   ├── V1_Speech1
    │   │   ├── V1_Speech1.wav
./data/annotations
    ├── AVAD
    │   ├── V1_Speech1
    │   │   ├── maps
    │   │   │   ├── eyeMap_00001.jpg
./data/fold_lists/
    ├── AVAD_list_test_1_fps.txt
    ├── AVAD_list_test_2_fps.txt
    ├── AVAD_list_test_3_fps.txt
    ├── AVAD_list_train_1_fps.txt
    │    ...

Running Scripts

The following is the pretrained training command:

sh scripts/train.sh

Then, use the following commands to fine-tune the model:

sh scripts/train_av.sh

If you want to infer the model, just remove --train, set --test, and leave the rest of the configuration unchanged.

Inference

We provide the pretrained weights in this share link. You need to creat a exp dir firstly, and then put the uncompressed pretrained weights into the exp dir. You just need to set the value of the root_path field in the given training command to the path where the pre-trained weights are saved, e.g.: --root_path=experiments_on_av_data/audio_visual.

BibTeX

🌟 If you find our project useful in your research or application development, citing our paper would be the best support for us!

@inproceedings{xiong2024diffsal,
    title={DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction},
    author={Junwen Xiong, Peng Zhang, Tao You, Chuanyue Li, Wei Huang and Yufei Zha},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2024}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__asserts__		__asserts__
cfgs		cfgs
datasets		datasets
metrics		metrics
models		models
scripts		scripts
util		util
.gitignore		.gitignore
README.md		README.md
compute_metrics.py		compute_metrics.py
compute_metrics.sh		compute_metrics.sh
diffusion_trainer.py		diffusion_trainer.py
model.py		model.py
recompute_overall_metrics.py		recompute_overall_metrics.py
requirements.txt		requirements.txt
train_av_data.py		train_av_data.py
train_dhf1k.py		train_dhf1k.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction (CVPR 24)

🔥 News

📌 TODOs

🔧 Setup

Environment Setup

🚅 How To Train

Data Structure

Running Scripts

Inference

BibTeX

About

Releases

Packages

Languages

junwenxiong/diff_sal

Folders and files

Latest commit

History

Repository files navigation

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction (CVPR 24)

🔥 News

📌 TODOs

🔧 Setup

Environment Setup

🚅 How To Train

Data Structure

Running Scripts

Inference

BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages