Skip to content

Commonvoice results misleading, complete overlap of train/dev/test sentences #2141

@bmilde

Description

@bmilde

I was quite surprised to see how low the WERs are for the new Common Voice corpus: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/RESULTS (4ish% TDNN)

Unfortunately, these results seem to be bogus because there is a near complete overlap of train/dev/test sentences and the LM is only trained on the corpus train sentences (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/local/prepare_lm.sh). To make matters worse, there aren't really that many unique sentences in the corpus:

unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355

This can also be easily verified by e.g. grepping "sadly my dream of becoming a squirrel whisperer may never happen" on the original corpus csvs:

cv-valid-dev.csv:cv-valid-dev/sample-000070.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,seventies,male,us,
cv-valid-dev.csv:cv-valid-dev/sample-000299.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,2,twenties,female,canada,
cv-valid-dev.csv:cv-valid-dev/sample-002458.mp3,sadly my dream of becoming a squirrel whisperer may never happen,9,1,,,,
cv-valid-dev.csv:cv-valid-dev/sample-003264.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-dev.csv:cv-valid-dev/sample-003656.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,,
grep: cv-valid-test: Is a directory
cv-valid-test.csv:cv-valid-test/sample-000221.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,thirties,male,canada,
cv-valid-test.csv:cv-valid-test/sample-001576.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,,
cv-valid-test.csv:cv-valid-test/sample-002831.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-test.csv:cv-valid-test/sample-003705.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-test.csv:cv-valid-test/sample-003789.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
grep: cv-valid-train: Is a directory
cv-valid-train.csv:cv-valid-train/sample-000324.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,2,,,,
cv-valid-train.csv:cv-valid-train/sample-000373.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,1,,,,
cv-valid-train.csv:cv-valid-train/sample-000382.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-001026.mp3,sadly my dream of becoming a squirrel whisperer may never happen,4,0,,,,
cv-valid-train.csv:cv-valid-train/sample-003106.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,fourties,female,england,
cv-valid-train.csv:cv-valid-train/sample-004591.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-005048.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-007144.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
+ 100s more...

Now this is pretty much a terrible design for a speech corpus, but I suggest to exclude the train sentences from the LM completely, to have somewhat more realistic results. I'm currently rerunning the scripts with a Cantab LM without the train sentences and will report back when I have the results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions