V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

For benchmarking purpose, this repo hosts the generated test samples of "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models", AAAI 2024. ([arXiv] [project])

Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai from University of Sydney and Dolby Laboratories.

Main Results

Compared to previous methods Im2Wav and CLIPSonic, our V2A-Mapper is trained with 86% fewer parameters but can achieve 53% and 19% improvement in Frechet Distance (FD, fidelity) and Clip-Score (CS, relevance), respectively.

VGGSound

VGGSound contains 199,176 10-second video clips extracted from videos uploaded to YouTube with audio-visual correspondence. Following the original train/test split, we evaluate the performance on 15,446 test samples. Our generated test samples (~5G) for VGGSound can be downloaded from here.

ImageHear

To testify the generalization ability of our V2A-Mapper, we also test on out-of-distribution dataset ImageHear which contains 101 images from 30 visual classes (2-8 images per class). Our generated test samples (~33M) for ImageHear can be downloaded from here.

Custom Datasets

If you need sample results by V2A-Mapper for your own datasets, we are happy to generate that for you. Please send the request to heng.wang@sydney.edu.au and jianbo.ma@dolby.com.

Citation

If you find our work helpful in your research, please kindly cite our paper via:

@inproceedings{v2a-mapper,
  title     = {V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models},
  author    = {Wang, Heng and Ma, Jianbo and Pascual, Santiago and Cartwright, Richard and Cai, Weidong},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2024},
}

Contact

If you have any questions or suggestions about this repo, please feel free to contact me! (heng.wang@sydney.edu.au)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
meta_data		meta_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

meta_data

meta_data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Main Results

VGGSound

ImageHear

Custom Datasets

Citation

Contact

About

Releases

Packages

License

heng-hw/V2A-Mapper

Folders and files

Latest commit

History

Repository files navigation

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Main Results

VGGSound

ImageHear

Custom Datasets

Citation

Contact

About

Topics

Resources

License

Stars

Watchers

Forks