Objects-that-Sound

An implementation of the paper "Objects that Sound".

Refered to the project " Learning intra-modal and cross-modal embeddings".

Objective

Our objective is to create a group of networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval. We achieve this by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of crossmodal self-supervision from video.

The paper shows that audio and visual embeddings that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval can be learnt. Yet we evaluate our network using only the latter, due to limited time.

Dataset

Audioset dataset

Architecture

Problems encountered and solved

Data preprocessing: we originally get images from mp4 by downloading videos from youtube, choose 10s-long fractions to use and then select images from these segments. The whole process was quite slow: we realized that the bottleneck was data processing rather than network training. However, it is possible to preprocesse data into the input format of our network and save them into hard disk beforehand. Making use of multiprocessing it was very fast and training was also greatly accelerated.
Preventing shortcuts: having been trained with one epoch, we found that the network was able to find a shortcut to exploit the not-so-randomized data to increase its correctness in training. It misused the latest data to rapidly adjust itself so as to pretend to be learned because we didn't shuffle our data in a range that was large enough. We therefore increased shuffle parameter so that data pairs would be well scattered.

Results

To-Do

Visualizing the results: display the retrieved results that matches each query best.
Localizing objects that sound: implement the latter part of the paper.
Extension to other evaluations: include other three types of queries (image-2-audio, image-2-image and audio-2-audio)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
img		img
rawdata		rawdata
weights		weights
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
model.py		model.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

rawdata

rawdata

weights

weights

LICENSE

LICENSE

README.md

README.md

dataset.py

dataset.py

model.py

model.py

train.py

train.py

validate.py

validate.py

Repository files navigation

Objects-that-Sound

Objective

Dataset

Architecture

Problems encountered and solved

Results

To-Do

About

Releases

Packages

Contributors 4

Languages

License

lrh2000/Objects-that-Sound

Folders and files

Latest commit

History

Repository files navigation

Objects-that-Sound

Objective

Dataset

Architecture

Problems encountered and solved

Results

To-Do

About

Resources

License

Stars

Watchers

Forks

Languages