Skip to content
Dan Oneață edited this page Aug 14, 2021 · 6 revisions
  • Finetuning. Finetune the CLIP models on the same data as the VGG (that is, Flickr30K and MS COCO)? This step will almost certainly improve the results (on Flickr8K), but it will make for a less realistic scenario.
  • Open vocabulary keyword spotting. Leverage pretrained visual–language models, so aligning the audio modality to the visual model, it will also implicitly align audio to language. Since the language branch can accept arbitrary text, we can in principle retrieve any word (or even sentence) in the speech signal. Differently from the current methodology, we will now need to train models in the embedding space. Note that this setup allows for the two languages (the one from the text branch and the one from the audio branch) to be different (as they are coupled through the visual channel), so it is approach a suitable for language documentation.
  • Generalization. Evaluate how well do our model work on a different dataset, such as Places Audio, SpeechCOCO?
  • Multi-task learning. Train on both audio–images pairs and audio–text pairs (these can be potentially un-aligned). Quantify the performance as a function of the quantity of audio–images pairs and audio–text pairs.
  • Last layer: sigmoids vs softmax. The current implementation uses sigmoids as the last layer (one for each word label), but maybe a reasonable alternative would be to use the softmax activation. Mahajan et al. (2018) give an argument for the softmax layer:

We have also experimented with per-hashtag sigmoid outputs and binary logistic loss, but obtained significantly worse results. While counter-intuitive given the multi-label data, these findings match similar observations in [16].

  • Related tasks. There are two very related tasks: ⅰ. multiple instance learning (in the machine learning community) and ⅱ. weakly-supervised object localization (in the computer vision community). I have recently learned that in computer vision, there is also the task of weakly supervised object detection, which is subtly different from localization; a friend of mine, who worked along these directions, summarized the difference as follows:
  • WSOL = in ImageNet-like images (~1 central object instance per image) where you know the image category, find the corresponding bounding box. Since these papers essentially aim to predict the ‘missing’ bounding boxes in imagenet-like images, they use just bbox-overlap based evaluation, instead of full-fledged detector evaluation.
  • WSOD = learn a detector based on image-wise labels, in complex scenes. The output is a typical object detector, and, the main evaluation is based on mAP, etc (ie. standard detection metrics)
  • Noisy labels. The visual teacher is inherently noisy (not everything that is visible in the image is uttered and, conversely, not all the uttered words are present in the image). Maybe we should take some explicit steps in modeling the label uncertainty. A popular approach seems to be allow updates of the teacher model. Here is a quote from a (Li et al., 2021), which I was reading recently:

[O]nline distillation [31, 32] simultaneously trains multiple models and use their ensemble as the teacher. Our momentum distillation can be interpreted as a form of online self-distillation, where a temporal ensemble of the student model is used as the teacher. Similar ideas have been explored in semi-supervised learning [33], label noise learning [34], and very recently in contrastive learning [35].

References

  • Li, Junnan, et al. "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation." arXiv preprint arXiv:2107.07651 (2021). link
  • Mahajan, Dhruv, et al. "Exploring the limits of weakly supervised pretraining." ECCV. 2018.

Clone this wiki locally