2021 11 03

Jump to bottom

Dan Oneață edited this page Nov 2, 2021 · 4 revisions

Minor points:

I've also been working on an alternative image for the attention based architecture, but I'm not sure if it's an improvement over the current version. I'm happy to take any feedback to improve it.
Qualitative results: Most common confusions. Maybe I should the "detected" utterances (detection score > 0.5) and not the top 20?
Figure 4: Ask Kayode to add text captions since I do not know which sample was selected.
Table 1: I am at fault for using the letters a, b, c, d to denote the architectures. At first, I envisioned some use for them, but in the end I believe it's better to stick to more meaningful names. Since they are not currently used anywhere, we can safely drop them.

Some ideas on what we could include in the experimental section on further results:

Keyword detection or spotting
- allows us to compare the architectures, regardless of the localisation techniques
- we can also include the visual teacher performance
- scatter plot of per keyword speech versus visual performance (see previous results)
Localisation results per keyword
- gives us an idea of the variance in performance across the vocabulary
- we can also try to optimize the threshold per keyword