Skip to content

2021 09 21

Dan Oneață edited this page Sep 20, 2021 · 11 revisions

Herman's definitions of keyword spotting and keyword detection:

  • In keyword detection, you are given one utterance and asked whether a particular keyword occurs in that utterance. This normally requires some threshold somewhere.
  • In keyword spotting, you are given a whole set of utterances. You are also given a keyword. You are then asked to rank all these utterances from most to least likely to contain the indicated keyword. This doesn't require a threshold. The metrics would look, for instance, at the top 10 ranked utterances, and ask how many of these actually contain the keyword (P@10).

Based on Herman's description, maybe a better depiction of the two tasks (based on their inputs and outputs) would be the following:

(In my mind, I believe I conflated the two tasks since the groundtruth used to perform the evaluation is the same for both: for all keyword–utterance pairs we know whether the keyword occurs in the utterance or not.)

  • Q I'm not a big fan of using the name "detection" since its use is different in the computer vision community. However, if this is standard in the speech processing community and if we are consistent about it, we can use it.
  • Q Why is it important to evaluate in terms of detection? Because it is coupled to the metrics on localization?

I believe these two tasks are extended to keyword localisation by augmenting the outputting an additional information: the time step where the input keyword occurs. Schematically, this could be illustrated as follows:

  • Q There are two features that are somewhat unconventional and not immediately obvious:
    1. we assume that the keyword appears exactly once in the utterance (as we predict a single location);
    2. the probability of occurrence (or binary decision) is computed for the entire utterance and not the predicted location.

Clone this wiki locally