Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

readme.md

Multimodal Modelling of Flickr Vision and Speech Data

Update

A more recent version of this recipe accompanying a new paper is available here.

Overview

This is a recipe for grounding untranscribed speech using paired images. Details are given in Kamper et al., 2017:

  • H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, "Visually grounded learning of keyword prediction from untranscribed speech," Proc. Interspeech, 2017.

Please cite this paper if you use this code.

Disclaimer

The code provided here is not pretty. But I believe research should be reproducible, and I hope that this repository is sufficient to make this possible for the above paper. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.

Datasets

The following datasets need to be obtained:

Flickr30k is used for training a vision tagging system. The Flickr8k audio and image datasets gives paired images with spoken captions; we do not use the labels from either of these. The Flickr8k text corpus is purely for reference. The Flickr8k dataset can also be browsed directly here.

Preliminary

Install all the standalone dependencies (below). Then clone the required GitHub repositories into ../src/ as follows:

mkdir ../src/
git clone https://github.com/kamperh/tflego.git ../src/tflego/

Download all the required datasets (above), and then update paths.py to point to the corresponding directories.

Feature extraction

Extract filterbank and MFCC features by running the steps in kaldi_features/readme.md.

Neural network training

Train the multi-label visual classifier by running the steps in vision_nn_flickr30k/readme.md. Note the final model directory.

Train the various visually grounded speech models by running the steps in speech_nn/readme.md.

Dependencies

Standalone packages:

Repositories from GitHub:

  • tflego: A wrapper for building neural networks. Should be cloned into the directory ../src/tflego/.

Contributors

About

Using computer vision to ground unlabelled speech.

Resources

Releases

No releases published

Packages

No packages published