Skip to content

A data set of semantic keyword spotting labels for the Flickr Audio Captions Corpus.

Notifications You must be signed in to change notification settings

kamperh/semantic_flickraudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Semantic Keyword Spotting Labels for the Flickr Audio Captions Corpus

Overview

This corpus was collected to investigate how a model trained on images paired with untranscribed speech could be used to perform semantic keyword spotting (a new speech retrieval task). The data set and associated task is described in:

  • H. Kamper, G. Shakhnarovich, and K. Livescu, "Semantic keyword spotting by learning from images and speech," arXiv preprint arXiv:1710.01949, 2017. [arXiv]

If you use this data, please cite the above work, as well as the ASRU'15 paper from the group at MIT that collected the original Flickr Audio Captions Corpus (see below).

Semantic keyword spotting

In standard keyword spotting, the task is to retrieve all utterances in a search collection containing spoken instances of a written query keyword. In semantic keyword spotting, instead of matching keywords exactly, the task is to retrieve all utterances that would be relevant if you searched for that keyword, irrespective of whether that keyword occurs in the utterance or not. For instance, given they query 'sidewalk', a model should return not only utterances containing the word exactly, but also speech like 'an old couple crossing a street'.

The corpus

The file keywords.txt lists the 67 keywords used in the task. The semantic labelling of utterances using these keywords are given in two files.

The CSV file semantic_flickraudio_labels.csv gives hard annotations for keywords that are relevant for specific utterances. The CSV file semantic_flickraudio_counts.csv gives the actual annotator counts. In this file, white=3|grass=1|dogs=5 for instance indicates that three out of five annotators marked as relevant the keyword 'white', one 'grass', and all five 'dogs'.

We only provide the semantic labels, without the associated audio or images. The original Flickr Audio Captions Corpus can be obtained here, while the original Flickr8k image corpus can be obtained here. Please cite these studies as well when using our corpus.

Semantic labels were collected only for 1000 test utterances in the corpus, one for each unique test image in Flickr8k.

License

The data is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).

About

A data set of semantic keyword spotting labels for the Flickr Audio Captions Corpus.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published