Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



66 Commits

Repository files navigation

Semantic Speech Retrieval using the Flickr Audio Captions Corpus


This is a recipe for training a model on images paired with untranscribed speech, and using this model for semantic keyword spotting. The model and this new task are described in the following publications:

  • H. Kamper, G. Shakhnarovich, and K. Livescu, "Semantic speech retrieval with a visually grounded model of untranscribed speech," IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 1, pp. 89-98, 2019. [arXiv]
  • H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, "Visually grounded learning of keyword prediction from untranscribed speech," in Proc. Interspeech, 2017. [arXiv]

Please cite these papers if you use the code.

Related repositories and datasets

A related recipe is also available, but this one is most recent recipe.

The semantic labels used here are also available separately in the semantic_flickraudio repository. Here we directly use processed versions of this dataset: all the pickled files in data/ starting with 06-16-23h59 were obtained directly from the semantic annotations.

The output of the multilabel visual classifier described below (also see vision_nn_1k/ can be downloaded directly here. We released these visual tags as part of the JSALT Rosetta project.


The code provided here is not pretty. But I believe research should be reproducible, and I hope that this repository is sufficient to make this possible for the above paper. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.


The following datasets need to be obtained:

MSCOCO and Flickr30k is used for training a vision tagging system. The Flickr8k audio and image datasets gives paired images with spoken captions; we do not use the labels from either of these. The Flickr8k text corpus is purely for reference. The Flickr8k dataset can also be browsed directly here.

Directory structure

  • data/ - Contains permanent data (file lists, annotations) that are used elsewhere.
  • speech_nn/ - Speech systems trained on the Flickr Audio Captions Corpus.
  • vision_nn_1k/ - Vision systems trained on Flickr30k, MSCOCO and Flickr30k+MSCOCO, but with the vocabulary given by the 1k most common words in Flickr30k+MSCOCO. Evaluation is also only for those 1k words.


Install all the standalone dependencies (below). Then clone the required GitHub repositories into ../src/ as follows:

mkdir ../src/
git clone ../src/tflego/

Download all the required datasets (above), and then update to point to the corresponding directories.

Feature extraction

Extract filterbank and MFCC features by running the steps in kaldi_features/

Neural network training

Train the multi-label visual classifier by running the steps in vision_nn_1k/ Note the final model directory.

Train the various visually grounded speech models by running the steps in speech_nn/


Standalone packages:

  • Python: I used Python 2.7.
  • NumPy and SciPy.
  • TensorFlow: Required by the tflego repository below. I used TensorFlow v0.10.
  • Kaldi: Used for feature extraction.

Repositories from GitHub:

  • tflego: A wrapper for building neural networks. Should be cloned into the directory ../src/tflego/.



The code is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).


Semantic speech retrieval with a visually grounded model of untranscribed speech.






No releases published


No packages published