Skip to content

2022 03 23

Dan Oneață edited this page Mar 23, 2022 · 7 revisions

Base architecture. We build upon a base architecture that consists of three components:

  1. CNN encoder: similar to the one proposed by Kayode, but with more subsampling layers (that is, convolutional layers with stride two) and smaller embedding dimension (128 versus 1000)
  2. Transformer: eight layers and four heads
  3. CLIP projection: layer normalization followed by a linear layer
teacher AP AP OOV
CLIP features 14.1 8.85
image tagger 26.6 1.38†
CLIP features + image labels ??? ???

Observations:

  • We use the same architecture, but vary the teacher and its loss.
  • Results for the image tagger teacher are in the same ballpark to those obtained with the JSTSP numbers (CNN-Attend).
  • † We use random performance to estimate the performance of the model supervised by the image tagger when applied to unseen (OOV) keywords. For comparison, the random performance on the subset of 67 seen words is 2.91%. These results indicate that the keywords in the seen subset are ι. more frequent; ⅱ. have better visual grounding.

Image tagger supervision. Train student audio network based on the image labels provided by the image tagger:

python train_emb.py --config labels-image-vgg
python predict_emb.py --config labels-image-vgg --to-eval-keyword-spotting

Unseen keywords. We start by selecting a random subset of 67 words from the top 1000 most common words from Kamper et al. that doesn't overlap with the preselected 67 keywords:

comm -23 data/flickr8k/vocab-1000.txt data/flickr8k/vocab-67-seen.txt | shuf | head -n67 | sort > data/flickr8k/vocab-67-unseen.txt

However, I ran into the following issue: I get nans when evaluating the keyword spotting performance. This happens because some of the words are rare and don't appear in the test set or in the entire Flickr8k dataset (e.g., "sandwich"). This makes sense in retrospect since the keywords are selected based on the captions from two other datasets: Flickr30K and MSCOCO. To fix this, we define a list of keywords based on the frequency in Flickr8k test set:

python select_vocab_unseen.py

and select the top 67 words that are more likely to have visual grounding (according to my own judgement). The 67 unseen keywords are available at

cat data/flickr8k/vocab-67-unseen-3.txt
Unseen keywords
dog
blue
playing
child
jumping
green
crowd
basketball
swing
wall
sunglasses
head
dress
table
grassy
hill
lake
wave
run
frisbee
bicycle
river
snowboarder
city
surfer
bench
dark
team
rocks
wooden
rocky
path
track
purple
ramp
skateboarder
sidewalk
boat
kids
helmet
uniform
smiles
fence
lady
couple
baseball
sign
dancing
coat
motorcycle
wet
skier
mountains
shore
rope
rider
suit
pants
ice
waves
glasses
toddler
sweater
horse
brick
sky
trees

Compute performance of the CLIP features:

python predict_emb.py --to-eval-keyword-spotting --config batch-size-256 --vocab vocab-67-unseen-3

Compute performance of a random classifier:

python eval_keyword_spotting_random.py --vocab vocab-67-unseen-3

TODO.

  • Evaluate base architecture with image-soft labels
  • Evaluate performance on out-of-vocabulary data
    • Select a random subset of 67 words from the top 1000 most commons; ensure they don't overlap with the preselected 67.
    • Compute performance of the CLIP features
    • Compute performance of a random classifier
  • Joint training with two teachers: CLIP featrues and image labels
    • Implement a multi-task approach: common trunk followed by two mapping networks, one for each of the teachers
    • Parameterize by mapping networks capacity
    • Parameterize by λ to control the weight of the two losses

Clone this wiki locally