Skip to content

2021 08 25

Dan Oneață edited this page Sep 7, 2021 · 9 revisions

Overview. I have trained an audio model to learn embeddings which are correlated to visual embeddings extracted with the CLIP model (an image–text model). This ideas was discussed previously and was illustrated in the following diagram:

Training. The loss was inspired by CLIP: in each batch the images play the role of classes and the model has to assign each utterance to the correct class (image) based on the similarity (cosine similarity) between image and audio embeddings. Concretely, the loss is cross entropy, with the target ids being the indices of the corresponding images, that is, [1, 2, 3, ..., B]. The larger the batch size, the harder the task—it is more difficult to pick the correct image. In the CLIP paper they use a batch of a whopping size of 32K samples! However, in my initial experiments I have used the more down-to-earth value of 32.

Architecture. I have started from Kayode's CNN-Attend network, but, somewhat surprisingly, the model was not able to learn—the loss was stuck at the value of -log(32) ≅ -3.46, corresponding to random chance. If I were to train on a single batch, the error did decrease, suggesting that maybe the network was not having enough capacity. For this reason, I took the CNN front-end and prepended a Transformer network (similar to what is done in the CLIP paper for the text branch, but smaller: 128 width, 8 layers, 4 heads); to keep the memory in check, I have decreased the temporal dimension of the MFCC features by using stride of two in four layers (which reduces the sequence length by a factor of 16 = 2⁴). Using the CNN-Transformer, I did manage to learn an initial model which is

Implementation. Since modifying Kayode's original code resulted in too many changes, I have decided to redo the training and evaluation script from scratch. See train_emb.py and predict_emb.py.

Evaluation: Retrieval. We have a model that learns to project three modalities into a common subspace. So we can evaluate by retrieving one modality based on the other. Below are quantitative results for all pairs.

R@1 R@5 R@10
audio → image 8.3 26.4 38.0
audio → text 3.6 11.8 18.1
image → audio 6.1 19.5 30.7
image → text 72.0 92.0 96.2
text → audio 2.3 7.8 11.5
text → image 57.1 82.9 90.7

When using the top 10 retrieved samples:

  • image-text and text-image are around 90–100%
  • audio-image and image-audio are around 30–40%
  • audio-text and text-audio are around 10–20%

For reference, here are results from prior work; this table were taken from the paper of Sanabria et al. (2021):

For qualitative example, see scripts/show_embedding_model_results.py. The retrieval involving audio is not unreasonable, but it mostly focuses on the main words in the utterance (e.g., the subject of the sentence).

Evaluation: Keyword spotting. Each of the 67 words in the vocabulary are embeded in the sentence "this is a photo of a {word}". For each embedded word, we rank the audio utterances based on the inverse cosine similarity and compute the average precision. The current results are around 10–11%, while the CNN-Attend obtains 26.5%.

TODO. Some possible things to do next:

  • hyperparameter tuning
  • anneal batch size from smaller values to larger ones
  • validation in terms of keyword spotting
  • train on a larger dataset

Questions.

  • should I use positional embeddings for speech?
  • improve audio encoder? maybe use wav2vec embeddings?

Clone this wiki locally