-
Notifications
You must be signed in to change notification settings - Fork 0
2022 03 09
Dan Oneață edited this page Mar 9, 2022
·
4 revisions
- Updated results in the Google Sheets, see sheet
2021-08-31 · train-embedding-space. - The average precision of the models trained on wav2vec2 is a bit better than previous models (15.9% versus 14.1%), but still below the one trained on labels from the image tagger (26.5%); note that the performance of the CLIP embeddings is around 22.6%.
- Would it be worth it, to have a model that combines best of both worlds?
- Predict intermediate CLIP features (allowing open-vocabulary retrieval)
- Predict soft labels based on the CNN image tagger (allow good in-vocabulary performance)
- Q Do we have out-of-vocabulary annotations?
- Q Would such a model fit in the story, or does it diverge too much?
- Q How important is it to have consistent architectures to previous models? For example, currently, I'm using a small Transformer on top of the CNN audio encoder.
