Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commonvoice results misleading, complete overlap of train/dev/test sentences #2141

Closed
bmilde opened this issue Jan 10, 2018 · 18 comments
Closed

Comments

@bmilde
Copy link

bmilde commented Jan 10, 2018

I was quite surprised to see how low the WERs are for the new Common Voice corpus: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/RESULTS (4ish% TDNN)

Unfortunately, these results seem to be bogus because there is a near complete overlap of train/dev/test sentences and the LM is only trained on the corpus train sentences (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/local/prepare_lm.sh). To make matters worse, there aren't really that many unique sentences in the corpus:

unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355

This can also be easily verified by e.g. grepping "sadly my dream of becoming a squirrel whisperer may never happen" on the original corpus csvs:

cv-valid-dev.csv:cv-valid-dev/sample-000070.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,seventies,male,us,
cv-valid-dev.csv:cv-valid-dev/sample-000299.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,2,twenties,female,canada,
cv-valid-dev.csv:cv-valid-dev/sample-002458.mp3,sadly my dream of becoming a squirrel whisperer may never happen,9,1,,,,
cv-valid-dev.csv:cv-valid-dev/sample-003264.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-dev.csv:cv-valid-dev/sample-003656.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,,
grep: cv-valid-test: Is a directory
cv-valid-test.csv:cv-valid-test/sample-000221.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,thirties,male,canada,
cv-valid-test.csv:cv-valid-test/sample-001576.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,,
cv-valid-test.csv:cv-valid-test/sample-002831.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-test.csv:cv-valid-test/sample-003705.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-test.csv:cv-valid-test/sample-003789.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
grep: cv-valid-train: Is a directory
cv-valid-train.csv:cv-valid-train/sample-000324.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,2,,,,
cv-valid-train.csv:cv-valid-train/sample-000373.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,1,,,,
cv-valid-train.csv:cv-valid-train/sample-000382.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-001026.mp3,sadly my dream of becoming a squirrel whisperer may never happen,4,0,,,,
cv-valid-train.csv:cv-valid-train/sample-003106.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,fourties,female,england,
cv-valid-train.csv:cv-valid-train/sample-004591.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-005048.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
cv-valid-train.csv:cv-valid-train/sample-007144.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,,
+ 100s more...

Now this is pretty much a terrible design for a speech corpus, but I suggest to exclude the train sentences from the LM completely, to have somewhat more realistic results. I'm currently rerunning the scripts with a Cantab LM without the train sentences and will report back when I have the results.

@entn-at
Copy link
Contributor

entn-at commented Jan 10, 2018

Well, that's the way the corpus was designed by the people at Mozilla (including the overlap in spoken content in the predefined train/dev/test splits). You should address your concerns to the people running this data collection effort (https://voice.mozilla.org/data, https://github.com/mozilla/voice-web). I believe Voxforge is similar in that there is an overlap in prompts.

This recipe is just using the corpus in its intended way (using the train/dev/test splits as provided). The results are not misleading, and to be sure, nobody is claiming that test results on this corpus somehow generalize to any other datasets/conditions.

EDIT: This isn't really a Kaldi problem; I suggest moving the discussion to kaldi-help: https://groups.google.com/forum/#!forum/kaldi-help

@bmilde
Copy link
Author

bmilde commented Jan 10, 2018

Thanks for your fast reply! Also, I see that you (@entn-at) wrote the scripts for this corpus. Thank you very much for adapting them so fast!

I wrote similar scripts for Eesen (Rnn-ctc) and was puzzled at the WER difference (~16% vs ~4%), but I used Kaldi-lm and it doesn't let you finish training if it discovers the training sentences in dev, thus my difference was that I omitted them in LM training.

Despite this not being a Kaldi problem, I'd still suggest the following enhancements to the Kaldi Common Voice recipe:

  • Enhance the recipe by adding a true LM (e.g. trained on Cantab) so that the models can be used on other phrases then these 6000 sentences. This will make the models useable in speech recognition tasks (what Mozilla actually intended) despite the corpus problem. Right now you probably wouldn't be able to decode much else besides these exact sentences and tuning the recipe is pointless since the best results are probably obtained with a large LM/FST prior.
  • Add a comment about this peculiarity to the scripts and/or results
  • Maybe an option to train without the train sentences in the LM to get better generalization results. I'm currently doing this and the GMM-HMM WERs are basically 2x as large.

Ultimately, I agree that this is a corpus problem. I will address my concerns on the mozilla git and link to this Issue (would be nice to keep the discussion here open as well) - this is surely not what they had in mind.

@jtrmal
Copy link
Contributor

jtrmal commented Jan 10, 2018 via email

@entn-at
Copy link
Contributor

entn-at commented Jan 10, 2018

The recipe in its current form was more of a research exercise and not intended to produce a set of models fit for more general use/distribution (like the models published at http://www.kaldi-asr.org/models.html). I fully agree that your suggestions would make this recipe more valuable to other users and would be more in line with what the people at Mozilla had intended. With @danpovey's and @jtrmal's approval, we could create a second version (s5b?) of the recipe that includes your proposed changes. In that case, I would like to encourage you to make PRs with your changes using the Cantab LM.

Ultimately, though, I somewhat share @jtrmal's doubts regarding the corpus (with its current design). The people working on Mozilla's DeepSpeech implementation seem to be using various corpora for model training (including non-free corpora like Fisher, SWBD, see their import scripts). Perhaps a subset of the CV corpus could be added to egs/multi_en?

@danpovey
Copy link
Contributor

danpovey commented Jan 10, 2018 via email

@bmilde
Copy link
Author

bmilde commented Jan 11, 2018

Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on that, I think it even gracefully showed an error message.

I've also posted the problem here on the mozilla discourse platform: https://discourse.mozilla.org/t/common-voice-v1-corpus-design-problems-overlapping-train-test-dev-sentences/24288

Chance are, they aren't really aware of it.

@jtrmal
Copy link
Contributor

jtrmal commented Jan 11, 2018 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jan 11, 2018 via email

@mikehenrty
Copy link

Hi Kaldi folks, I'm the maintainer of the Common Voice project, and as such am responsible for both our small text corpus and improper split.

First of all, let me say that it was great to see Common Voice data integrated into the Kaldi project so quickly. Also, big thanks to @bmilde for finding and reporting this bug, and for suggesting a way to fix this on our repo. For a discussion of our plan for addressing this, you can check out that bug.

One thing I noticed in this thread:

I have certain doubts about the utility or usefulness of the Mozilla corpora anyway.

Both @jtrmal and @entn-at seemed to have this sentiment (perhaps others?). Are the problems other than our small corpus size and train/dev/test split that you are concerned about? We are in the process of updating Common Voice now, and it would be super helpful to get your feedback so we can make our data more useful to your project (and other STT engines).

On the topic of text corpora, we are working with the University of Illinois to get them to release the Flickr30K text corpus under cc-0, which would allow us to use it's 100K some sentences. For an example of what these look like, you can see here:
https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/not-used/flickr30k.txt

If it's not too much to ask, we would love expert feedback on if the above corpus would be helpful for speech engines. You'll notice the sentences are all description of images, so the utterances have a lot of repeated words. Unfortunately, I am not a speech technologist, so I don't have any intuition as to whether this is the right data or not for utterances. Again, any info you could provide would be very helpful.

Thanks again, and thanks for maintaining Kaldi!

@ognjentodic
Copy link

Here are some suggestions:

  • use wav files (16kHz, 16bits), not mp3
  • if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android
  • keep speaker information (some sort of hash), across different recordings
  • less focus on reading, more on spontaneous speech (this will end up being more costly, since speech data will need to be transcribed); you can ask people to talk about a number of different topics... have two people talk to each other, or person talk to a bot; you can also take a look at LDC website and get some inspiration from datasets collected in the past (I would not necessarily blindly follow their approaches though)
  • additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth
  • capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

@sikoried
Copy link
Contributor

sikoried commented Jan 26, 2018 via email

@vdp
Copy link
Contributor

vdp commented Jan 27, 2018

I completely agree with Ognjen's and Korbinian's suggestions, especially the one about conversational speech. This is indeed the major missing piece when it comes to open speech corpora. At this point collecting more scripted English is not going to be very useful IMO. Except maybe for distant speech or speech in noisy environments, but I'm not familiar with these domains- perhaps someone who is researching robust recognition will chime in. I'd say for clean read English speech the data problem is pretty much solved. For example the LibriSpeech's results for non-accented English, i.e. {dev,test}-clean, are already fairly close to the theoretical minimum- keep in mind that some of those errors are actually due to different spellings of some names etc.
As it was already pointed out a conversational corpus is going to be much more costly in terms of corpus design and data collection effort, as well as transcription fees, but in contrast to read speech it's going to be very useful. Aside from the cost conversational speech is hard to come by as it's rarely released in public, because it tends to be private in nature.

If English conversational speech is not an option, then I would suggest to at least concentrate on languages other than English. The coverage there is more than spotty, so even read speech is going to be useful.

@entn-at
Copy link
Contributor

entn-at commented Jan 28, 2018

The prompted nature of the collected speech was my main concern as well. Perhaps you can collect transcribed conversational speech via crowd sourcing as follows (I'm posting some ideas here and explicitly want to invite constructive criticism/comments by everybody; I know this has nothing to do with Kaldi per se, but people here have a lot of experience with data collected for ASR and can give valuable advice to @mikehenrty):

  • To collect conversational speech, let users call other users via WebRTC. Conversation will be easier if users already know each other. There are ways of recording a WebRTC call, for example using RTCMultiConnection and RecordRTC. AFAIK, WebRTC gateways like the open-source Janus project have plugins for recording calls. Each caller needs to be informed that the call is being recorded (and for what purpose), of course. There are obvious privacy issues for releasing the data and crowd sourcing the transcription effort, and reminders not to disclose private information only go so far.
  • For word-level transcription, use existing LVCSR systems (e.g. based on Kaldi, Mozilla DeepSpeech, or cloud speech APIs) to segment calls (i.e., create relatively short chunks) and transcribe them. Human listeners verifying these automatically transcribed segments could listen to segments and provide feedback in multiple ways:
    1. Transcription correct: yes/no. This is the least-effort feedback option.
    2. Add buttons (or click-/tap-able areas) for each word and between words. Users can click/tap on words to indicate insertion/substitution or between words to indicate deletions. More effort required by the listener.
    3. In addition to (ii.), let users correct incorrectly transcribed segments/enter fully manual transcription. Higher effort required.

Keeping the required effort low is important for crowd sourcing, but as has been pointed out, badly or inconsistently transcribed speech isn't that useful. There is no question that a professionally designed and transcribed speech corpus is preferable in every way, but I'd be interested in feedback/suggestions (also and especially in the form of "This won't work because...").

Disclaimer: I'm not involved in this data collection effort, I'm just trying to be helpful.

@mikehenrty
Copy link

Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread.

Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!

use wav files (16kHz, 16bits), not mp3

We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.

keep speaker information (some sort of hash), across different recordings

We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.

less focus on reading, more on spontaneous speech...

Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track.

To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something," but with audio).

That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

Good idea! We have a bug for this:
common-voice/common-voice#814

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see?

@jtrmal
Copy link
Contributor

jtrmal commented Jan 31, 2018 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jan 31, 2018 via email

@galv
Copy link
Contributor

galv commented Jan 31, 2018 via email

@johnjosephmorgan
Copy link
Contributor

johnjosephmorgan commented Apr 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants