Zihan Zhang1,*,
Xize Cheng1,*,
Zhennan Jiang2,*,
Dongjie Fu1,
Jingyuan Chen1
Zhou Zhao1,
Tao Jin1,†
1Zhejiang University
2CASIA
†Corresponding author.
*Equal contribution
ICLR 2026
Please refer to the script under dataset/music.
Please refer to the script under dataset/vggsound.
Clone the repository and set up the environment:
git clone https://github.com/mars-sep/ImageBind.git
cd Imagebind
pip install .
git clone https://github.com/mars-sep/MARS-Sep.git
cd MARS-Sep/
conda create -n marssep python=3.10
conda activate marssep
pip install -r requirements.txtpython train.py \
-o exp/vggsound/marssep \
-c conf/mars.yaml
-t data/vggsound/train.csv \
-v data/vggsound/val.csv \
--batch_size 128 \
--workers 20 \
--emb_dim 1024 \
--train_mode image text audio \
--is_feature \
--feature_mode imagebindEvaluate on MUSIC and VGGSound.
OMP_NUM_THREADS=1 python evaluate.py -o exp/vggsound/marssep/ -c conf/mars.yaml -l exp/vggsound/marssep/eval_MUSIC_VGGS.txt -t data/MUSIC/solo/test.csv -t2 data/vggsound/test-good-no-music.csv --no-pit --prompt_ens --audio_source ./MUSIC-aq.npyEvaluate on VGGSound-Clean+ and VGGSound.
OMP_NUM_THREADS=1 python evaluate.py -o exp/vggsound/marssep/ -c conf/mars.yaml -l exp/vggsound/marssep/eval_VGGS_VGGSN.txt -t data/vggsound/test-good.csv -t2 data/vggsound/test-no-music.csv --no-pit --prompt_ens --audio_source ./VGGSOUND-aq.npyOMP_NUM_THREADS=1 python infer3.py -o exp/vggsound/marssep/ -i "demo/audio/hvCj8Dk0Su4.wav" --text_query "playing bagpipes" -f "exp/vggsound/marssep/hvCj8Dk0Su4/playing bagpipes.wav"If you find our work useful for your research, please feel free to cite our paper:
@misc{zhang2025marssepmultimodalalignedreinforcedsound,
title={MARS-Sep: Multimodal-Aligned Reinforced Sound Separation},
author={Zihan Zhang and Xize Cheng and Zhennan Jiang and Dongjie Fu and Jingyuan Chen and Zhou Zhao and Tao Jin},
year={2025},
eprint={2510.10509},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.10509},
}