Skip to content

lrh2000/Objects-that-Sound

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Objects-that-Sound

An implementation of the paper "Objects that Sound".

Refered to the project " Learning intra-modal and cross-modal embeddings".

Objective

Our objective is to create a group of networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval. We achieve this by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of crossmodal self-supervision from video.

The paper shows that audio and visual embeddings that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval can be learnt. Yet we evaluate our network using only the latter, due to limited time.

Dataset

Audioset dataset

Architecture

architecture

Problems encountered and solved

  • Data preprocessing: we originally get images from mp4 by downloading videos from youtube, choose 10s-long fractions to use and then select images from these segments. The whole process was quite slow: we realized that the bottleneck was data processing rather than network training. However, it is possible to preprocesse data into the input format of our network and save them into hard disk beforehand. Making use of multiprocessing it was very fast and training was also greatly accelerated.
  • Preventing shortcuts: having been trained with one epoch, we found that the network was able to find a shortcut to exploit the not-so-randomized data to increase its correctness in training. It misused the latest data to rapidly adjust itself so as to pretend to be learned because we didn't shuffle our data in a range that was large enough. We therefore increased shuffle parameter so that data pairs would be well scattered.

Results

Results

To-Do

  • Visualizing the results: display the retrieved results that matches each query best.
  • Localizing objects that sound: implement the latter part of the paper.
  • Extension to other evaluations: include other three types of queries (image-2-audio, image-2-image and audio-2-audio)

About

An implementation of the paper "Objects that Sound".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages