Skip to content

2021 10 06

Dan Oneață edited this page Oct 5, 2021 · 11 revisions

Table of contents:

Most common confusions

Task: For each query keyword, find the most common localised words. To generate the following results I have taken the following steps:

  1. Given a keyword, sort utterances in decreasing order of their detection score (i.e., per-utterance score);
  2. Select the localised word based on the maximum of the localisation scores (what we refer to by α in the paper);
  3. Show the top 5 most common words.

Some remarks:

  • semantic clusters enforced by the bias in the data and the lack of visual grounding of some of the words (airsnowboarder, bike; oceansurfer; ballsoccer; swimmingpool; facerock, climbing);
  • I find the performance for car surprsing—given the fact that car is a visually grounded word, I would have expected much better localisation and detection, but the word seems confused with race and red;
  • dogs are often associated with numerals (two, three), I assume because we distinguish between singular and plural words (dog is treated as a negative for dogs);
  • very accurate results for some colors (black, white, yellow), but interesting confusions for pink (pinkgirl) and orange (orangered).
AIR        → SNOWBOARDER     ( 8) · BIKE            ( 2) · SNOWBOARD       ( 1) · <SPOKEN_NOISE>  ( 1) · A               ( 1)
BABY       → BABY            (16) · BABIES          ( 2) · TODDLER         ( 1) · BATHING         ( 1)
BALL       → SOCCER          ( 8) · BALL            ( 6) · BALLS           ( 1) · A               ( 1) · YELLOW          ( 1)
BEACH      → BEACH           (15) · THE             ( 2) · A               ( 1) · BENCH           ( 1) · FRISBEE         ( 1)
BIKE       → BIKE            ( 7) · BIKER           ( 4) · BICYCLE         ( 4) · BIKES           ( 2) · MOTORCYCLE      ( 1)
BLACK      → BLACK           (19) · DOG             ( 1)
BOY        → BOY             (11) · YOUNG           ( 2) · IN              ( 1) · COWBOYS         ( 1) · DRAGS           ( 1)
BROWN      → BROWN           (15) · DOGS            ( 2) · YELLOW          ( 1) · DOG             ( 1) · A               ( 1)
BUILDING   → BUILDING        (10) · BUILDINGS       ( 2) · PEOPLE          ( 2) · WALL            ( 2) · WALKS           ( 1)
CAMERA     → CAMERA          (10) · SUNGLASSES      ( 1) · WOMAN           ( 1) · SWIMSUIT        ( 1) · SMILING         ( 1)
CAR        → RACE            ( 6) · RED             ( 5) · RACETRACK       ( 2) · RAISES          ( 2) · CAR             ( 2)
CARRYING   → BLACK           ( 5) · DOG             ( 5) · A               ( 3) · MOUTH           ( 3) · DOGS            ( 1)
CHILDREN   → CHILDREN        (10) · PIRATES         ( 1) · BOY             ( 1) · SHOULDERS       ( 1) · KIDS            ( 1)
CLIMBING   → CLIMBING        ( 8) · CLIMBS          ( 3) · ROCK            ( 2) · CLIMBER         ( 2) · MOUNTAIN        ( 1)
DIRT       → BIKE            (11) · BIKER           ( 5) · BIKES           ( 2) · MOTORCYCLE      ( 1) · DIRT            ( 1)
DOGS       → DOGS            ( 6) · THREE           ( 4) · TWO             ( 4) · FIELD           ( 1) · PUPPY           ( 1)
FACE       → ROCK            ( 7) · CLIMBING        ( 5) · ROCKS           ( 1) · BLONDE          ( 1) · GIRL            ( 1)
FIELD      → FIELD           ( 6) · GRASS           ( 4) · SOCCER          ( 3) · GRASSY          ( 2) · HILL            ( 1)
FOOTBALL   → FOOTBALL        (12) · PLAYER          ( 6) · PLAYERS         ( 2)
GRASS      → GRASS           (15) · FIELD           ( 2) · DRESS           ( 1) · HILL            ( 1) · GRASSY          ( 1)
HAIR       → GIRL            ( 4) · SMILING         ( 2) · HOLDING         ( 2) · BLONDE          ( 1) · A               ( 1)
HAT        → MAN             ( 6) · HAT             ( 2) · None            ( 2) · PINK            ( 1) · AND             ( 1)
HOLDING    → MOUTH           ( 3) · BOY             ( 3) · None            ( 2) · CASTING         ( 1) · BLACK           ( 1)
JACKET     → SNOW            ( 5) · RED             ( 3) · MAN             ( 2) · <SPOKEN_NOISE>  ( 1) · SMOKING         ( 1)
JUMPS      → JUMPS           ( 8) · AIR             ( 4) · JUMPING         ( 3) · BRICK           ( 1) · JUMP            ( 1)
LARGE      → ROCK            ( 7) · WATER           ( 4) · ROCKY           ( 2) · ROCKS           ( 2) · LARGE           ( 1)
LITTLE     → GIRL            ( 4) · LITTLE          ( 2) · BOY             ( 1) · THE             ( 1) · IS              ( 1)
MOUNTAIN   → MOUNTAIN        ( 8) · MOUNTAINS       ( 7) · HORSE           ( 1) · MOUNTAINEER     ( 1) · SNOWY           ( 1)
MOUTH      → BLACK           ( 6) · DOG             ( 6) · A               ( 2) · THE             ( 1) · GROUND          ( 1)
OCEAN      → SURFER          (13) · SURFBOARD       ( 3) · WAVE            ( 2) · SURFING         ( 1) · SURFACE         ( 1)
ORANGE     → RED             (12) · SMALL           ( 2) · RUNNING         ( 1) · DARK            ( 1) · PLAYER          ( 1)
PARK       → SKATEBOARD      ( 4) · SWING           ( 3) · BOY             ( 2) · <SPOKEN_NOISE>  ( 2) · PLAYING         ( 1)
PINK       → GIRL            (11) · GIRLS           ( 3) · LITTLE          ( 2) · YOUNG           ( 1) · WELL            ( 1)
POOL       → POOL            (20)
RACE       → RACE            (10) · RACING          ( 4) · RACES           ( 1) · RED             ( 1) · AROUND          ( 1)
RED        → RED             (14) · CLIMBING        ( 2) · FIRE            ( 1) · A               ( 1) · DROPPING        ( 1)
RIDES      → BIKE            ( 5) · RIDING          ( 3) · WAVES           ( 2) · WAVE            ( 2) · SURFER          ( 1)
RIDING     → SURFER          ( 6) · SURFBOARD       ( 4) · SURFING         ( 4) · WAVE            ( 3) · WAVES           ( 2)
ROAD       → RACE            ( 5) · RIDER           ( 3) · BICYCLE         ( 3) · RED             ( 2) · RACETRACK       ( 1)
ROCK       → ROCK            ( 7) · MOUNTAIN        ( 2) · CLIMBING        ( 2) · A               ( 2) · ROCKY           ( 2)
RUNNING    → BLACK           ( 7) · RUNNING         ( 5) · DOG             ( 3) · A               ( 1) · WHITE           ( 1)
SAND       → BEACH           (10) · SAND            ( 3) · PEOPLE          ( 1) · SHORE           ( 1) · ROCKY           ( 1)
SHIRT      → SHIRT           ( 4) · ROCK            ( 3) · BOY             ( 2) · BASEBALL        ( 2) · BLUE            ( 1)
SITS       → SITTING         (13) · SITS            ( 3) · SIT             ( 2) · SIPS            ( 1) · READING         ( 1)
SITTING    → SITTING         (10) · SITS            ( 2) · SIT             ( 2) · SLIDING         ( 1) · TWO             ( 1)
SKATEBOARD → SKATEBOARD      ( 8) · <SPOKEN_NOISE>  ( 7) · SKATEBOARDING   ( 3) · SKATES          ( 1) · SKATING         ( 1)
SMALL      → BABY            ( 4) · DOG             ( 3) · BUBBLES         ( 1) · BOAT            ( 1) · YOUNG           ( 1)
SMILING    → SMILING         ( 5) · HOLDING         ( 2) · SWINGING        ( 1) · SMILES          ( 1) · GREEN           ( 1)
SNOW       → SNOW            (10) · <SPOKEN_NOISE>  ( 3) · SNOWBOARDER     ( 2) · SNOWY           ( 1) · SNOWBOARD       ( 1)
SNOWY      → SNOW            ( 8) · MOUNTAIN        ( 4) · SNOWY           ( 4) · <SPOKEN_NOISE>  ( 1) · SNOWBOARDER     ( 1)
SOCCER     → SOCCER          (20)
STANDS     → STANDING        ( 5) · STANDS          ( 3) · MAN             ( 2) · ROCKY           ( 2) · EXTENDED        ( 1)
STICK      → HOCKEY          ( 5) · ROCK            ( 5) · ROCKY           ( 3) · BROWN           ( 2) · TALKING         ( 1)
STREET     → STREET          (15) · WALKING         ( 1) · ROAD            ( 1) · PROTESTERS      ( 1) · STREAKED        ( 1)
SWIMMING   → POOL            (18) · SWIMMING        ( 1) · SWIM            ( 1)
TENNIS     → PLAYER          ( 3) · PLAYING         ( 3) · RACKET          ( 2) · PLAYERS         ( 1) · BASKETBALL      ( 1)
THREE      → THREE           ( 3) · CHILDREN        ( 3) · TWO             ( 3) · PLAY            ( 1) · TREE            ( 1)
TOP        → SURFER          ( 8) · SURFBOARD       ( 4) · ROCK            ( 4) · SURFING         ( 2) · ROCKS           ( 1)
TOY        → DOG             ( 8) · BLACK           ( 5) · MOUTH           ( 3) · None            ( 1) · GRASS           ( 1)
TREE       → TREE            ( 9) · TREES           ( 7) · TRAINING        ( 1) · <SPOKEN_NOISE>  ( 1) · CLIMBS          ( 1)
WALKS      → WALKS           ( 4) · STREET          ( 4) · WALKING         ( 3) · SIDEWALK        ( 2) · STANDING        ( 2)
WATER      → WATER           ( 9) · RIVER           ( 5) · SWIMMING        ( 2) · PEOPLE          ( 1) · FISHERMAN       ( 1)
WEARING    → MAN             ( 2) · BLUE            ( 2) · RED             ( 1) · CAMERA          ( 1) · YOUNG           ( 1)
WHITE      → WHITE           (18) · TWO             ( 1) · DOG             ( 1)
WOMEN      → WOMAN           ( 4) · WOMEN           ( 3) · TWO             ( 3) · SCARVES         ( 1) · STREET          ( 1)
YELLOW     → YELLOW          (17) · BIKES           ( 1) · BIKE            ( 1) · SLIGHTLY        ( 1)
YOUNG      → BOY             ( 9) · GIRL            ( 3) · YOUNG           ( 2) · LITTLE          ( 1) · A               ( 1)

Selected qualitative examples

I have prepare a small script to show the results as here. Here is a subset of figures that I've found interesting (I've focused more towards the erroneous cases since I've found those more interesting):

air


air gets a high score, but it is nearly suprassed by snowboarder


jumps seems to (reasonably) trigger large scores

boy


localisation is slighlty off, but it might be due to imprecise forced alignment


boy is included in cowboys and is reasonably localised

bike


correct localisation


bikes is treated as negative


biker is also not a bike

car


red car?


bikes is treated as negative


race is preferred even if car appears in the utterance!

swimming


correct localisation, but interestingly the localisation score for swimming is rather low (compare to pool in next sample)


pool is often classified as swimming


Abandoned?. Qualitative results based on sorting the utterances by the largest localisation score:

  • "ball":
    • M · 2370481277_a3085614c9_0: soccer
    • M · 2021613437_d99731f986_0: basketball
    • – · 3015863181_92ff43f4d8_1: bowling get high score
  • "bike":
    • M · 2764178773_d63b502812_1: bicycle
    • M · 2229179070_dc8ea8582e_3: by

Visual network versus speech network

I believe it would be interesting to see how the performance of the speech network is tightly couple to the one of the visual network. For this we could draw a scatter plot with each keyword being described by its visual and speech performance (for example, keyword spotting). We could draw the diagonal line: what is above the diagonal will indicate that the speech network outperforms the visual teacher network. Here is a sample plot, but it should be updated with the latest data (and also it would be instructive to label some of the points with the keyword):

Clone this wiki locally