2022 03 09

Jump to bottom

Dan Oneață edited this page Mar 9, 2022 · 4 revisions

Updated results in the Google Sheets, see sheet 2021-08-31 · train-embedding-space.
The average precision of the models trained on wav2vec2 is a bit better than previous models (15.9% versus 14.1%), but still below the one trained on labels from the image tagger (26.5%); note that the performance of the CLIP embeddings is around 22.6%.
Would it be worth it, to have a model that combines best of both worlds?
- Predict intermediate CLIP features (allowing open-vocabulary retrieval)
- Predict soft labels based on the CNN image tagger (allow good in-vocabulary performance)
- Q Do we have out-of-vocabulary annotations?
- Q Would such a model fit in the story, or does it diverge too much?
- Q How important is it to have consistent architectures to previous models? For example, currently, I'm using a small Transformer on top of the CNN audio encoder.