Given a video clip of a falling object, the goal of this task is to generate the corresponding sound based on the visual appearance and motion of the object. The generated sound must match the object’s intrinsic properties (e.g., material type) and temporally align with the object’s movement in the given video. This task is related to prior work on sound generation from in-the-wild videos, but here we focus more on predicting soundtracks that closely match the object dynamics.
The dataset used to train the baseline models can be downloaded from here
Start the training process, and test the best model on test-set after training:
python main.py --batch_size 32 --weight_decay 1e-2 --lr 1e-3 \
--model RegNet --exp RegNet \
--config_location ./configs/regnet_aux_4.yml
Evaluate the best model in RegNet:
python main.py --batch_size 32 --weight_decay 1e-2 --lr 1e-3 \
--model RegNet --exp RegNet \
--config_location ./configs/regnet_aux_4.yml \
--eval
To train and test your new model on ObjectFolder Sound Generation of Dynamic Objects Benchmark, you only need to modify several files in models, you may follow these simple steps.
-
Create new model directory
mkdir models/my_model
-
Design new model
cd models/my_model touch my_model.py
-
Build the new model and its optimizer
Add the following code into models/build.py:
elif args.model == 'my_model': from my_model import my_model model = my_model.my_model(args) optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-
Add the new model into the pipeline
Once the new model is built, it can be trained and evaluated similarly:
python main.py --batch_size 32 --weight_decay 1e-2 --lr 1e-3 \ --model my_model --exp my_model \ --config_location ./configs/my_model.yml
In our experiments, we choose 500 objects with reasonable scales, and 10 videos are generated for each object. We split the 10 videos into train/val/test splits of 8/1/1.
Method | STFT | Envelope | CDPAM |
RegNet | 0.010 | 0.036 | 0.0000565 |
MCR | 0.034 | 0.042 | 0.0000592 |