We use this model in Track #4 Sound Source Localization of [2024 ICME Grand Challenge] Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC). Our results are cIoU 47.70 AUC 44.14 (sspl_w_pcm) and cIoU 41.12 AUC 42.60 (sspl_wo_pcm).
We have tested the code on the following environment:
Python 3.8.0 | torch 1.10.0+cu113 | torchaudio 0.10.0+cu113 | torchvision 0.11.1+cu113 | CUDA 11.4
[sspl_w_pcm]: epoch: 350 | devices: RTX 3070 * 1 | batch_size_per_gpu: 64 | img_size: 224
[sspl_wo_pcm]: epoch: 40 | devices: RTX 3070 * 1 | batch_size_per_gpu: 128 | img_size: 224
Please refer to the SSPL/metadata/Pre-data.md file.
We utilize [VGG16] and [VGGish] as backbones
to extract visual and audio features, respectively. Before training, you need to place pre-trained VGGish weights,
i.e., vggish-10086976.pth,
in models/torchvggish/torchvggish/vggish_pretrained/
.
To train SSPL on SoundNet-Flickr10k with default setting, simply run:
python main.py
Remember to specify your own MASTER_ADDR and MASTER_PORT in main.py and path to metadata in arguments_train.py
Note: We found that learning rates have vital influence on SSPL's performance. So we suggest that using the early stopping strategy to select hyper-parameters and avoid overfitting.
After training, frame_best.pth
, sound_best.pth
, ssl_head_best.pth
(and pcm_best.pth
for SSPL (w/ PCM))
can be obtained.
To test SSPL on Chaotic World
with default setting, simply run:
python test.py
Remember to specify your own MASTER_ADDR and MASTER_PORT in test.py and path to metadata in arguments_test.py
You can download our checkpoint and best weights.
sspl_w_pcm_ChaoticWorld.zip
sspl_wo_pcm_ChaoticWorld.zip