GitHub - ml-lab/multisensory: Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Code for the paper:

Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018

Setup

Install Python 2.7
Install ffmpeg
Install TensorFlow, e.g. through pip:

pip install tensorflow     # for CPU evaluation only
pip install tensorflow-gpu # for GPU support

Install other python dependencies

pip install numpy matplotlib pillow scipy

Download the pretrained models and sample data

./download_models.sh
./download_sample_data.sh

Pretrained audio-visual features

We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see shift_example.py for a simple example that uses these pretrained features.

Audio-visual source separation

To try the on/off-screen source separation model, run:

python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/

This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):

Input: . On-screen: Off-screen:

We can visually mask out one of the two on-screen speakers, thereby removing their voice:

python sep_video.py ../data/crossfire.mp4 --model full --mask l --out ../results/
python sep_video.py ../data/crossfire.mp4 --model full --mask r --out ../results/

This produces the following videos (click to watch):

Source: Left: Right:

Blind (audio-only) source separation

This baseline trains a u-net model to minimize a permutation invariant loss.

python sep_video.py ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/

The model will write the two separated streams in an arbitrary order.

Visualizing the locations of sound sources

To view the self-supervised network's class activation map (CAM), use the --cam flag:

python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/

This produces a video in which the CAM is overlaid as heat map:

Action recognition

Coming soon!

Citation

@article{multisensory2018,
  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Owens, Andrew and Efros, Alexei A},
  journal={arXiv preprint arXiv:1804.03641},
  year={2018}
}

Acknowledgements

Our u-net implementation draws from this implementation implementation of pix2pix.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
doc		doc
src		src
README.md		README.md
download_models.sh		download_models.sh
download_sample_data.sh		download_sample_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

Setup

Pretrained audio-visual features

Audio-visual source separation

Blind (audio-only) source separation

Visualizing the locations of sound sources

Action recognition

Citation

Acknowledgements

About

Releases

Packages

Languages

ml-lab/multisensory

Folders and files

Latest commit

History

Repository files navigation

Contents

Setup

Pretrained audio-visual features

Audio-visual source separation

Blind (audio-only) source separation

Visualizing the locations of sound sources

Action recognition

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages