-
Notifications
You must be signed in to change notification settings - Fork 0
2021 09 21
Herman's definitions of keyword spotting and keyword detection:
- In keyword detection, you are given one utterance and asked whether a particular keyword occurs in that utterance. This normally requires some threshold somewhere.
- In keyword spotting, you are given a whole set of utterances. You are also given a keyword. You are then asked to rank all these utterances from most to least likely to contain the indicated keyword. This doesn't require a threshold. The metrics would look, for instance, at the top 10 ranked utterances, and ask how many of these actually contain the keyword (P@10).
From Herman's description,
maybe a better depiction of the two tasks (based on their inputs and outputs) would be the following (instead of what I have currently drawn in the draft paper):

(In my mind, I believe I conflated the two tasks since the groundtruth used to perform the evaluation is the same for both: for all keyword–utterance pairs we know whether the keyword occurs in the utterance or not.)
- Q I'm not a big fan of using the name "detection" since its use is different in the computer vision community. However, if this is standard in the speech processing community and if we are consistent about it, we can use it.
- Q Why is it important to evaluate in terms of detection? Because it is coupled to the metrics on localisation?
I believe these two tasks are extended to the keyword localisation scenario by having an additional output:
the time step where the input keyword occurs.
Schematically, this could be illustrated as follows:

-
Q The methodology has some features that are somewhat unconventional and not immediately obvious:
- the probability of occurrence (and, consequently, the binary decision) is computed for the entire utterance and not the predicted location;
- we assume that the keyword appears exactly once in the utterance (as we predict a single location);
- the localization doesn't involve duration.
Keyword localisation: Actual setup. For a given utterance and keyword, the prediction of the keyword localisation task has one of two possible values:
- not found, if the probability is under the threshold τ;
-
found at time step
t, if the probability is over the threshold τ.
This output is compared to the groundtruth, which also has a similar shape:
- not occurring, if the keyword is missing from the utterance;
-
occurring from
stoe, if the keyword is present in the utterance and starts at timesand ends at timee.
To evaluate in terms of the desired metrics (precision, recall, F1 score), we use the following mapping to compute the intermediary quantities (true negatives and positives; false negatives and positives):
| groundtruth | prediction | category |
|---|---|---|
| not occurring | not found | TN |
| not occurring | found at time step t
|
FP |
occurring from s to e
|
not found | FN |
occurring from s to e
|
found at time step t
|
TP if s ≤ t < e else FP |
Q That's not precise! In fact the groundtruth should be a list of segments, since a keyword can appear multiple times in an utterance. Maybe something along these lines, although it doesn't look pretty.
| groundtruth | prediction | category |
|---|---|---|
| not occurring | not found | TN |
| not occurring | found at time step t
|
FP |
occurring in segments [(s₁, e₁), ...]
|
not found | FN |
occurring in segments [(s₁, e₁), ...]
|
found at time step t
|
TP if any {s ≤ t < e ⋮ (s, e) ← segments} else FP |
Keyword localisation: Oracle setup. Note that for the "oracle" case the first two lines are not relevant since we ignore those cases for which the groundtruth doesn't contain the prediction.