-
Notifications
You must be signed in to change notification settings - Fork 0
2021 10 06
Table of contents:
Task: For each query keyword, find the most common localised words. To generate the following results I have taken the following steps:
- Given a keyword, sort utterances in decreasing order of their detection score (i.e., per-utterance score);
- Select the localised word based on the maximum of the localisation scores (what we refer to by α in the paper);
- Show the top 5 most common words.
Some remarks:
- semantic clusters enforced by the bias in the data and the lack of visual grounding of some of the words (air → snowboarder, bike; ocean → surfer; ball → soccer; swimming → pool; face → rock, climbing);
- I find the performance for car surprsing—given the fact that car is a visually grounded word, I would have expected much better localisation and detection, but the word seems confused with race and red;
- dogs are often associated with numerals (two, three), I assume because we distinguish between singular and plural words (dog is treated as a negative for dogs);
- very accurate results for some colors (black, white, yellow), but interesting confusions for pink (pink → girl) and orange (orange → red).
AIR → SNOWBOARDER ( 8) · BIKE ( 2) · SNOWBOARD ( 1) · <SPOKEN_NOISE> ( 1) · A ( 1)
BABY → BABY (16) · BABIES ( 2) · TODDLER ( 1) · BATHING ( 1)
BALL → SOCCER ( 8) · BALL ( 6) · BALLS ( 1) · A ( 1) · YELLOW ( 1)
BEACH → BEACH (15) · THE ( 2) · A ( 1) · BENCH ( 1) · FRISBEE ( 1)
BIKE → BIKE ( 7) · BIKER ( 4) · BICYCLE ( 4) · BIKES ( 2) · MOTORCYCLE ( 1)
BLACK → BLACK (19) · DOG ( 1)
BOY → BOY (11) · YOUNG ( 2) · IN ( 1) · COWBOYS ( 1) · DRAGS ( 1)
BROWN → BROWN (15) · DOGS ( 2) · YELLOW ( 1) · DOG ( 1) · A ( 1)
BUILDING → BUILDING (10) · BUILDINGS ( 2) · PEOPLE ( 2) · WALL ( 2) · WALKS ( 1)
CAMERA → CAMERA (10) · SUNGLASSES ( 1) · WOMAN ( 1) · SWIMSUIT ( 1) · SMILING ( 1)
CAR → RACE ( 6) · RED ( 5) · RACETRACK ( 2) · RAISES ( 2) · CAR ( 2)
CARRYING → BLACK ( 5) · DOG ( 5) · A ( 3) · MOUTH ( 3) · DOGS ( 1)
CHILDREN → CHILDREN (10) · PIRATES ( 1) · BOY ( 1) · SHOULDERS ( 1) · KIDS ( 1)
CLIMBING → CLIMBING ( 8) · CLIMBS ( 3) · ROCK ( 2) · CLIMBER ( 2) · MOUNTAIN ( 1)
DIRT → BIKE (11) · BIKER ( 5) · BIKES ( 2) · MOTORCYCLE ( 1) · DIRT ( 1)
DOGS → DOGS ( 6) · THREE ( 4) · TWO ( 4) · FIELD ( 1) · PUPPY ( 1)
FACE → ROCK ( 7) · CLIMBING ( 5) · ROCKS ( 1) · BLONDE ( 1) · GIRL ( 1)
FIELD → FIELD ( 6) · GRASS ( 4) · SOCCER ( 3) · GRASSY ( 2) · HILL ( 1)
FOOTBALL → FOOTBALL (12) · PLAYER ( 6) · PLAYERS ( 2)
GRASS → GRASS (15) · FIELD ( 2) · DRESS ( 1) · HILL ( 1) · GRASSY ( 1)
HAIR → GIRL ( 4) · SMILING ( 2) · HOLDING ( 2) · BLONDE ( 1) · A ( 1)
HAT → MAN ( 6) · HAT ( 2) · None ( 2) · PINK ( 1) · AND ( 1)
HOLDING → MOUTH ( 3) · BOY ( 3) · None ( 2) · CASTING ( 1) · BLACK ( 1)
JACKET → SNOW ( 5) · RED ( 3) · MAN ( 2) · <SPOKEN_NOISE> ( 1) · SMOKING ( 1)
JUMPS → JUMPS ( 8) · AIR ( 4) · JUMPING ( 3) · BRICK ( 1) · JUMP ( 1)
LARGE → ROCK ( 7) · WATER ( 4) · ROCKY ( 2) · ROCKS ( 2) · LARGE ( 1)
LITTLE → GIRL ( 4) · LITTLE ( 2) · BOY ( 1) · THE ( 1) · IS ( 1)
MOUNTAIN → MOUNTAIN ( 8) · MOUNTAINS ( 7) · HORSE ( 1) · MOUNTAINEER ( 1) · SNOWY ( 1)
MOUTH → BLACK ( 6) · DOG ( 6) · A ( 2) · THE ( 1) · GROUND ( 1)
OCEAN → SURFER (13) · SURFBOARD ( 3) · WAVE ( 2) · SURFING ( 1) · SURFACE ( 1)
ORANGE → RED (12) · SMALL ( 2) · RUNNING ( 1) · DARK ( 1) · PLAYER ( 1)
PARK → SKATEBOARD ( 4) · SWING ( 3) · BOY ( 2) · <SPOKEN_NOISE> ( 2) · PLAYING ( 1)
PINK → GIRL (11) · GIRLS ( 3) · LITTLE ( 2) · YOUNG ( 1) · WELL ( 1)
POOL → POOL (20)
RACE → RACE (10) · RACING ( 4) · RACES ( 1) · RED ( 1) · AROUND ( 1)
RED → RED (14) · CLIMBING ( 2) · FIRE ( 1) · A ( 1) · DROPPING ( 1)
RIDES → BIKE ( 5) · RIDING ( 3) · WAVES ( 2) · WAVE ( 2) · SURFER ( 1)
RIDING → SURFER ( 6) · SURFBOARD ( 4) · SURFING ( 4) · WAVE ( 3) · WAVES ( 2)
ROAD → RACE ( 5) · RIDER ( 3) · BICYCLE ( 3) · RED ( 2) · RACETRACK ( 1)
ROCK → ROCK ( 7) · MOUNTAIN ( 2) · CLIMBING ( 2) · A ( 2) · ROCKY ( 2)
RUNNING → BLACK ( 7) · RUNNING ( 5) · DOG ( 3) · A ( 1) · WHITE ( 1)
SAND → BEACH (10) · SAND ( 3) · PEOPLE ( 1) · SHORE ( 1) · ROCKY ( 1)
SHIRT → SHIRT ( 4) · ROCK ( 3) · BOY ( 2) · BASEBALL ( 2) · BLUE ( 1)
SITS → SITTING (13) · SITS ( 3) · SIT ( 2) · SIPS ( 1) · READING ( 1)
SITTING → SITTING (10) · SITS ( 2) · SIT ( 2) · SLIDING ( 1) · TWO ( 1)
SKATEBOARD → SKATEBOARD ( 8) · <SPOKEN_NOISE> ( 7) · SKATEBOARDING ( 3) · SKATES ( 1) · SKATING ( 1)
SMALL → BABY ( 4) · DOG ( 3) · BUBBLES ( 1) · BOAT ( 1) · YOUNG ( 1)
SMILING → SMILING ( 5) · HOLDING ( 2) · SWINGING ( 1) · SMILES ( 1) · GREEN ( 1)
SNOW → SNOW (10) · <SPOKEN_NOISE> ( 3) · SNOWBOARDER ( 2) · SNOWY ( 1) · SNOWBOARD ( 1)
SNOWY → SNOW ( 8) · MOUNTAIN ( 4) · SNOWY ( 4) · <SPOKEN_NOISE> ( 1) · SNOWBOARDER ( 1)
SOCCER → SOCCER (20)
STANDS → STANDING ( 5) · STANDS ( 3) · MAN ( 2) · ROCKY ( 2) · EXTENDED ( 1)
STICK → HOCKEY ( 5) · ROCK ( 5) · ROCKY ( 3) · BROWN ( 2) · TALKING ( 1)
STREET → STREET (15) · WALKING ( 1) · ROAD ( 1) · PROTESTERS ( 1) · STREAKED ( 1)
SWIMMING → POOL (18) · SWIMMING ( 1) · SWIM ( 1)
TENNIS → PLAYER ( 3) · PLAYING ( 3) · RACKET ( 2) · PLAYERS ( 1) · BASKETBALL ( 1)
THREE → THREE ( 3) · CHILDREN ( 3) · TWO ( 3) · PLAY ( 1) · TREE ( 1)
TOP → SURFER ( 8) · SURFBOARD ( 4) · ROCK ( 4) · SURFING ( 2) · ROCKS ( 1)
TOY → DOG ( 8) · BLACK ( 5) · MOUTH ( 3) · None ( 1) · GRASS ( 1)
TREE → TREE ( 9) · TREES ( 7) · TRAINING ( 1) · <SPOKEN_NOISE> ( 1) · CLIMBS ( 1)
WALKS → WALKS ( 4) · STREET ( 4) · WALKING ( 3) · SIDEWALK ( 2) · STANDING ( 2)
WATER → WATER ( 9) · RIVER ( 5) · SWIMMING ( 2) · PEOPLE ( 1) · FISHERMAN ( 1)
WEARING → MAN ( 2) · BLUE ( 2) · RED ( 1) · CAMERA ( 1) · YOUNG ( 1)
WHITE → WHITE (18) · TWO ( 1) · DOG ( 1)
WOMEN → WOMAN ( 4) · WOMEN ( 3) · TWO ( 3) · SCARVES ( 1) · STREET ( 1)
YELLOW → YELLOW (17) · BIKES ( 1) · BIKE ( 1) · SLIGHTLY ( 1)
YOUNG → BOY ( 9) · GIRL ( 3) · YOUNG ( 2) · LITTLE ( 1) · A ( 1)
I have prepare a small script to show the results as here. Here is a subset of figures that I've found interesting (I've focused more towards the erroneous cases since I've found those more interesting):
air gets a high score, but it is nearly suprassed by snowboarder
jumps seems to (reasonably) trigger large scores
localisation is slighlty off, but it might be due to imprecise forced alignment
boy is included in cowboys and is reasonably localised
correct localisation
bikes is treated as negative
biker is also not a bike
red car?
bikes is treated as negative
race is preferred even if car appears in the utterance!
correct localisation, but interestingly the localisation score for swimming is rather low (compare to pool in next sample)
pool is often classified as swimming
Abandoned?. Qualitative results based on sorting the utterances by the largest localisation score:
- "ball":
- M ·
2370481277_a3085614c9_0: soccer - M ·
2021613437_d99731f986_0: basketball - – ·
3015863181_92ff43f4d8_1: bowling get high score
- M ·
- "bike":
- M ·
2764178773_d63b502812_1: bicycle - M ·
2229179070_dc8ea8582e_3: by
- M ·
I believe it would be interesting to see how the performance of the speech network is tightly couple to the one of the visual network.
For this we could draw a scatter plot with each keyword being described by its visual and speech performance (for example, keyword spotting).
We could draw the diagonal line: what is above the diagonal will indicate that the speech network outperforms the visual teacher network.
Here is a sample plot, but it should be updated with the latest data (and also it would be instructive to label some of the points with the keyword):
