Skip to content

2021 09 08

Dan Oneață edited this page Sep 7, 2021 · 2 revisions

I have extended the previous evaluation of the audio-to-CLIP model by paying more attention to the hyperparameters and evaluation metrics. Regarding the hyperparameters, I have now experimented with larger batch sizes (64, 128, 256) and two different learning rates (4e-4 and 2e-4); note that the batch size impacts the number of negative samples for the contrastive loss, which seems to be an important factor in attaining good performance (see SimCLR paper?). The evaluation now includes ⅰ. the keyword spotting of the 67 selected words and ⅱ. the retrieval performance between all pairs of modalities in which at least one is audio (that is, audio → image, audio → text, image → audio, text → audio).

I have also made some implementation improvements to the code (although I don't think any of these is crucial for achieving good performance):

  • learning rate scheduler (warm-up to the desired learning rate, followed by cosine annealing over a single period)
  • optimizer: AdamW
  • use weight decay, but not for biases or gains (normalizations)
  • use symmetric loss: that is, match both audio to images and images to audio
  • log results to wandb

The results are available in the Google Sheets document, see sheet entitled 2021-08-31 · train-embedding-space.

tl;dr

  • performance improves with batch size (best results for batch size of 256)
  • somewhat better performance for the lower learning rate (2e-4)
  • audio → image and image → audio performance is similar to more recent approaches, but still behind to the most recent work of Sanabria et al. (2021)

Next steps:

  • I would like to train on a larger dataset, and I was eyeing David Harwath's Spoken COCO (which consists of 600K utterances)
  • I will try the Transformer architecture, but in the same setting as what Kayode used for the journal paper

P.S.: Loosely related—last week I've also spent some time going through the InfoNCE paper and I've written a short note that offers an alternative proof (to the one in the paper) that the InfoNCE loss is a lower bound on the mutual information.

Clone this wiki locally