A data set of semantic keyword spotting labels for the Flickr Audio Captions Corpus.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Semantic Keyword Spotting Labels for the Flickr Audio Captions Corpus


This corpus was collected to investigate how a model trained on images paired with untranscribed speech could be used to perform semantic keyword spotting (a new speech retrieval task). The data set and associated task is described in:

  • H. Kamper, G. Shakhnarovich, and K. Livescu, "Semantic keyword spotting by learning from images and speech," arXiv preprint arXiv:1710.01949, 2017. [arXiv]

If you use this data, please cite the above work, as well as the ASRU'15 paper from the group at MIT that collected the original Flickr Audio Captions Corpus (see below).

Semantic keyword spotting

In standard keyword spotting, the task is to retrieve all utterances in a search collection containing spoken instances of a written query keyword. In semantic keyword spotting, instead of matching keywords exactly, the task is to retrieve all utterances that would be relevant if you searched for that keyword, irrespective of whether that keyword occurs in the utterance or not. For instance, given they query 'sidewalk', a model should return not only utterances containing the word exactly, but also speech like 'an old couple crossing a street'.

The corpus

The file keywords.txt lists the 67 keywords used in the task. The semantic labelling of utterances using these keywords are given in two files.

The CSV file semantic_flickraudio_labels.csv gives hard annotations for keywords that are relevant for specific utterances. The CSV file semantic_flickraudio_counts.csv gives the actual annotator counts. In this file, white=3|grass=1|dogs=5 for instance indicates that three out of five annotators marked as relevant the keyword 'white', one 'grass', and all five 'dogs'.

We only provide the semantic labels, without the associated audio or images. The original Flickr Audio Captions Corpus can be obtained here, while the original Flickr8k image corpus can be obtained here. Please cite these studies as well when using our corpus.

Semantic labels were collected only for 1000 test utterances in the corpus, one for each unique test image in Flickr8k.


The data is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).