2021 08 11

Visual models versus textual labels. I have performed a quantitative and qualitative evaluation of the CNN (VGG) network and multiple CLIP variants by comparing the outputs of the visual networks to the extracted text labels; see the Google sheet for more details on the experimental set-up. The main conclusions are that while the CNN has a slight edge over CLIP, but the latter has the benefits of being able to predict new words and arguably offers better generalization on other datasets. In a sense the better performance of the CNN is not surprising since this network was trained on very similar data as the one we are evaluating on (Flickr30k). However, the evaluation is tricky and performance should be taken with a grain of salt since the labels are noisy (see the qualitative examples). While at first I suggested that the visual models provide an upper-bound that is in fact not true and, indeed, the speech model performs better! In retrospect this makes sense, since speech does contain information that is more relevant to the uttered text than the image. (I see now that a similar evaluation and conclusions appear in Herman's journal paper.)
On related tasks and evaluation. There are two very related tasks: ⅰ. multiple instance learning (in the machine learning community) and ⅱ. weakly-supervised object localization (in the computer vision community). (However, our task is even more difficult than those problems as it is weakly supervised in two senses: the labels are weak and the localization is unknown.) I have recently learned that in computer vision, there is also the task of weakly supervised object detection, which is subtly different from localization; a friend of mine, who worked along these directions, summarized the difference as follows:

WSOL = in ImageNet-like images (~1 central object instance per image) where you know the image category, find the corresponding bounding box. Since these papers essentially aim to predict the ‘missing’ bounding boxes in imagenet-like images, they use just bbox-overlap based evaluation, instead of full-fledged detector evaluation.

WSOD = learn a detector based on image-wise labels, in complex scenes. The output is a typical object detector, and, the main evaluation is based on mAP, etc (ie. standard detection metrics)

Last layer and loss function. Have you considered using softmax as a last layer activation? I have noticed that in Ankita's work, you use softmax, but in the rest of the papers only sigmoids. I have also found an argument for softmax in this article:

We have also experimented with per-hashtag sigmoid outputs and binary logistic loss, but obtained significantly worse results. While counter-intuitive given the multi-label data, these findings match similar observations in [16].

— Mahajan, Dhruv, et al. "Exploring the limits of weakly supervised pretraining." Proceedings of the European conference on computer vision (ECCV). 2018.

Next steps. I would like to discuss possible next steps:
1. Fine-tune the CLIP models on the same data as the VGG (that is, Flickr30K and MS COCO)? This step will almost certainly improve the results (on Flickr8K), but it will make for a less realistic scenario.
2. Train in the embedding space?
3. Evaluate on a separate dataset: Places Audio, SpeechCOCO?
4. Vary the quantity of audio–images pairs when audio–text pairs are also available?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2021 08 11

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally