Skip to content
Dataset for Visually Indicated Sound Generation by Perceptually Optimized Classification
Branch: master
Clone or download
Latest commit 7d994d4 Sep 15, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
img add annotation files Sep 12, 2018
LICENSE Initial commit Aug 1, 2018
README.md
download_vig.py modify dl script Sep 12, 2018
vig_annotation.pkl add annotation files Sep 12, 2018
vig_class_map.pkl add annotation files Sep 12, 2018
vig_dl.lst add annotation files Sep 12, 2018
vig_test.lst add annotation files Sep 12, 2018
vig_train.lst add annotation files Sep 12, 2018

README.md

VIG Dataset

This repository includes Visually Indicated sound Generation (VIG) dataset mentioned in Visually Indicated Sound Generation by Perceptually Optimized Classification (Best paper in ECCV MULA workshop 2018).

Introduction

Visually Indicated Sound Generation aims to predict visually consistent sound from the video content. Previous methods in Visually Indicated Sounds addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-of-the-art sound classification networks are available to capture semantic-level information in audio modality, which can also serve for the purposeof visually indicated sound generation.

Framework

We explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. The framework of POCAN is shown below.

VIG dataset download

Data processing is based on Python 2.7

We provide the Youtube ID for each video in the file vig_dl.lst. You may use tools like youtube-dl to download these videos (A sample download script download_vig.py is provided in this repository). In file vig_dl.lst, each youtube video is mapped to a file ID in each line. The files vig_train.lst and vig_test.lst specify the training and test videos by these file IDs respectively. For annotation files, we provide start time (key name start_time), end time (key name end_time) and sound class label (key name vig_label) in the file vig_annotation.pkl. The sound class is mapped to a class ID in the annotation file. The map between class name and class ID is provided in the file vig_class_map.pkl.

Some demo video clips as well as sound waveform and spectrogram are shown in the figure below.

Performance of POCAN on VIG

We choose the recall at top K (R@K) as the metric for retrieving sound in the test set in VIG. The performance of POCAN is listed in the table below.

Model K = 1 K = 5 K = 10
Owens et al [1] 0.0997 0.2888 0.4640
POCAN 0.1223 0.3625 0.4802

More details can be found in the paper.

Citation

If you find the repository is useful for your research, please consider citing the following work:

@inproceedings{chen2018visually,
  title={Visually Indicated Sound Generation by Perceptually Optimized Classification},
  author={Chen*, Kan and Zhang*, Chuanxi and Fang, Chen and Wang, Zhaowen and Bui, Trung and Nevatia, Ram}
  booktitle={ECCV MULA Workshop},
  year={2018}
}

Reference

[1] Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." In CVPR. 2016

You can’t perform that action at this time.