Skip to content

2022 03 09

Dan Oneață edited this page Mar 9, 2022 · 4 revisions
  • Updated results in the Google Sheets, see sheet 2021-08-31 · train-embedding-space.
  • The average precision of the models trained on wav2vec2 is a bit better than previous models (15.9% versus 14.1%), but still below the one trained on labels from the image tagger (26.5%); note that the performance of the CLIP embeddings is around 22.6%.
  • Would it be worth it, to have a model that combines best of both worlds?
    • Predict intermediate CLIP features (allowing open-vocabulary retrieval)
    • Predict soft labels based on the CNN image tagger (allow good in-vocabulary performance)
    • Q Do we have out-of-vocabulary annotations?
    • Q Would such a model fit in the story, or does it diverge too much?
    • Q How important is it to have consistent architectures to previous models? For example, currently, I'm using a small Transformer on top of the CNN audio encoder.

Clone this wiki locally