Official PyTorch implementation of our following papers:
Sound Source Localization is All About Cross-Modal Alignment
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
ICCV 2023
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
arXiV 2024
- Overview
- Interactive Synthetic Sound Source (IS3) Dataset
- Environment
- Model Checkpoints
- Inference
- Training
- Citation
IS3 dataset is available here
The IS3 data is organized as follows:
Note that in IS3 dataset, each annotation is saved as a separate file. For example; the sample accordion_baby_10467
image contains two annotations for accordion and baby objects. These annotations are saved as accordion_baby_10467_accordion
and accordion_baby_10467_baby
for straightforward use. You can always project bounding boxes or segmentation maps onto the original image to see them all at once.
images
and audio_waw
folders contain all the image and audio files respectively.
IS3_annotation.json
file contains ground truth bounding box and category information of each annotation.
gt_segmentation
folder contains segmentation maps in binary image format for each annotation. You can query the file name in IS3_annotation.json
to get semantic category of each segmentation map.
The model checkpoints are available for the following experiments:
Training Set | Test Set | Model Type | Performance (cIoU) | Checkpoint |
---|---|---|---|---|
VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. | 39.94 | Link |
VGGSound-144K | VGG-SS | NN w/ Self-Sup. Pre. Enc. | 39.16 | Link |
VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. Pre-trained Vision | 41.42 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. | 85.20 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Self-Sup. Pre. Enc. | 84.80 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. Pre-trained Vision | 86.00 | Link |
Put checkpoint files into the 'checkpoints' directory:
inference
│
└───checkpoints
│ ours_sup_previs.pth.tar
│ ours_sup.pth.tar
│ ours_selfsup.pth.tar
│ test.py
│ datasets.py
│ model.py
To evaluate a trained model run
python test.py --testset {testset_name} --pth_name {pth_name}
Test Set | testset_name |
---|---|
VGG-SS | vggss |
Flickr-SoundNet | flickr |
IS3 | is3 |
Simply save the checkpoint files from the methods as '{method_name}_{put_your_own_message}.pth', such as 'ezvsl_flickr.pth'. We have already handled the trivial settings.
Paper title | pth_name must contains |
---|---|
Localizing Visual Sounds the Hard Way (CVPR 21) [Paper] | lvs |
Localizing Visual Sounds the Easy Way (ECCV 22) [Paper] | ezvsl |
A Closer Look at Weakly-Supervised Audio-Visual Source Localization (NeurIPS 22) [Paper] | slavc |
Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation (ACMMM 22) [Paper] | ssltie |
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning (CVPR 23) [Paper] | fnac |
Example
python test.py --testset flickr --pth_name ezvsl_flickr.pth
Training code is coming soon!
If you find this code useful, please consider giving a star ⭐ and citing us:
@inproceedings{senocak2023sound,
title={Sound source localization is all about cross-modal alignment},
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7777--7787},
year={2023}
}
If you use this dataset, please consider giving a star ⭐ and citing us:
@article{senocak2024align,
title={Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment},
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
journal={arXiv preprint arXiv:2407.13676},
year={2024}
}