-
Notifications
You must be signed in to change notification settings - Fork 0
2022 03 23
Base architecture. We build upon a base architecture that consists of three components:
- CNN encoder: similar to the one proposed by Kayode, but with more subsampling layers (that is, convolutional layers with stride two) and smaller embedding dimension (128 versus 1000)
- Transformer: eight layers and four heads
- CLIP projection: layer normalization followed by a linear layer
| teacher | AP | AP OOV |
|---|---|---|
| CLIP features | 14.1 | 8.85 |
| image tagger | 26.6 | 1.38† |
| CLIP features + image labels | ??? | ??? |
Observations:
- We use the same architecture, but vary the teacher and its loss.
- Results for the image tagger teacher are in the same ballpark to those obtained with the JSTSP numbers (CNN-Attend).
- † We use random performance to estimate the performance of the model supervised by the image tagger when applied to unseen (OOV) keywords. For comparison, the random performance on the subset of 67 seen words is 2.91%. These results indicate that the keywords in the seen subset are ι. more frequent; ⅱ. have better visual grounding.
Image tagger supervision. Train student audio network based on the image labels provided by the image tagger:
python train_emb.py --config labels-image-vgg
python predict_emb.py --config labels-image-vgg --to-eval-keyword-spottingUnseen keywords. We start by selecting a random subset of 67 words from the top 1000 most common words from Kamper et al. that doesn't overlap with the preselected 67 keywords:
comm -23 data/flickr8k/vocab-1000.txt data/flickr8k/vocab-67-seen.txt | shuf | head -n67 | sort > data/flickr8k/vocab-67-unseen.txtHowever, I ran into the following issue: I get nans when evaluating the keyword spotting performance.
This happens because some of the words are rare and don't appear in the test set or in the entire Flickr8k dataset (e.g., "sandwich").
This makes sense in retrospect since the keywords are selected based on the captions from two other datasets: Flickr30K and MSCOCO.
To fix this, we define a list of keywords based on the frequency in Flickr8k test set:
python select_vocab_unseen.pyand select the top 67 words that are more likely to have visual grounding (according to my own judgement). The 67 unseen keywords are available at
cat data/flickr8k/vocab-67-unseen-3.txtUnseen keywords
dog
blue
playing
child
jumping
green
crowd
basketball
swing
wall
sunglasses
head
dress
table
grassy
hill
lake
wave
run
frisbee
bicycle
river
snowboarder
city
surfer
bench
dark
team
rocks
wooden
rocky
path
track
purple
ramp
skateboarder
sidewalk
boat
kids
helmet
uniform
smiles
fence
lady
couple
baseball
sign
dancing
coat
motorcycle
wet
skier
mountains
shore
rope
rider
suit
pants
ice
waves
glasses
toddler
sweater
horse
brick
sky
treesCompute performance of the CLIP features:
python predict_emb.py --to-eval-keyword-spotting --config batch-size-256 --vocab vocab-67-unseen-3Compute performance of a random classifier:
python eval_keyword_spotting_random.py --vocab vocab-67-unseen-3TODO.
- Evaluate base architecture with image-soft labels
- Evaluate performance on out-of-vocabulary data
- Select a random subset of 67 words from the top 1000 most commons; ensure they don't overlap with the preselected 67.
- Compute performance of the CLIP features
- Compute performance of a random classifier
- Joint training with two teachers: CLIP featrues and image labels
- Implement a multi-task approach: common trunk followed by two mapping networks, one for each of the teachers
- Parameterize by mapping networks capacity
- Parameterize by λ to control the weight of the two losses