Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine train / dev / test split before recording utterances #806

Closed
bmilde opened this issue Jan 15, 2018 · 12 comments
Closed

Determine train / dev / test split before recording utterances #806

bmilde opened this issue Jan 15, 2018 · 12 comments

Comments

@bmilde
Copy link

bmilde commented Jan 15, 2018

I’m a speech researcher and I want use common voice speech data for my experiments. Unfortunately there is a big problem with how the corpus (v1) and specifically the train/test/dev split is designed. This issue is also being discussed here: https://github.com/kaldi-asr/kaldi/issues/21419 (Commonvoice results misleading, complete overlap of train/dev/test sentences #2141) and I've posted the issue to the discourse message board, too.

There is a near complete overlap (>99%) of train/dev/test sentences in Common Voice v1:

unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355

The goal should be to reach the optimum of no sentence overlap + no speaker overlap in the train / dev / test split as close as possible. Ideally, also, the dev and test utterances are all unique utterances (no repetitions), to accurately measure an unbiased WER. The train corpus may have repetitions, but having too many, as right now in the v1 release, might be detrimental and would lead to overfitting in typical speech recognition models. But if dev/test are correctly set up, then WER numbers can atleast be accurately measured and the effect might be taken into account. My guess is that right now utterances to record are selected uniformly at random from a corpus of prespecified text. This random selection process needs to be changed so that some new text inputs are labeled apriori as belonging to dev/test and are recorded just a single time, while the selection process for training utterances remains as is.

Otherwise, any WER numbers reported on this train/dev/test split are pretty much meaningless unfortunately and encourage absolute overfitting on the training data. Also best results are obtained when the language model is trained only on the train sentences, without any other sentences. This is pretty much what the Kaldi recipe is doing now (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice2) and it encourages recognizing only these ~7000 sentences and nothing else really.

Solving this critical bug is important to get publications and researchers to use the corpus.

@mikehenrty
Copy link
Member

Hi @bmilde.

First of all, thank you for reporting this bug. It is indeed very critical. One of the reasons we wanted to release this data so quickly was to get this kind of feedback from people like you, so bravo!

I have spoken about this split with our machine learning group (which is different than the Common Voice team that I am a part of), and there are a couple of solutions we are investigating.

One, we can redo the split between dev/train/test to make sure there are no overlapping speakers or sentences. The problem with this is that the dev and test datasets will probably need to become a lot smaller due to our limited sentence corpus.

Another approach is to modify our Common Voice server (ie. this repo), to have special sentences that are in quarantined to the test/dev sets, and then make certain users only get those sentences for reading. This is a better approach in the long term, since it means we could grow the test/dev sets larger, and wouldn't have to worry about throwing out any training data (again due to our small corpus size).

We will be investigating the above approaches in the coming weeks, and will definitely fix this in the next release of the data (v2). In the meantime, any advice or information you (or anyone reading this) has in regards to this problem would be very welcomed.

@bmilde
Copy link
Author

bmilde commented Jan 26, 2018

I agree, having special sentences quarantined with the intend to make dev/test recordings of them is probably the cleanest solution. Also if you or someone else starts to use voice-web for recoding speech of other languages, you would have to plan the train/test/dev split ahead of recoding any sentences and are less likely to run into the same issue.

@Gregoor
Copy link
Contributor

Gregoor commented Feb 20, 2018

Alrighty, implementation time. Here's my thinking on how we'll do it:

  • existing data stays train data (as I understand it, we need most data in the train dataset)
  • add enum in sentences table with ('train', 'dev', 'test')
  • add enum in users table with ('train', 'dev', 'test')
  • when a new user/sentence comes along, assign them to any of those according difference to the ideal recording distribution (e.g. 60% train, 20% dev, 20% test) and the real one (which is 100% train, 0% the rest in the beginning). We should only calculate the real split every few day
  • only serve sentences to record which are in the same bucket as the user

@bmilde @mikehenrty
Is the general idea what you all had in mind? Also what is the ideal split? 60/20/20?

@mikehenrty
Copy link
Member

The implementation plan sounds right to me.

CC @kdavis-mozilla to see if he also has any ideas about the split ratios.

@kdavis-mozilla
Copy link
Contributor

The ratios seem reasonable.

However, a better split might be 64% train, 16% dev, and 20% test

For example the Fisher data set splits are 64% train, 16% dev, 20% test and the Switchboard data set splits are 64% train, 16% dev, 20% test too.

@bmilde
Copy link
Author

bmilde commented Feb 20, 2018

TEDLIUMs split is 5% dev, 12% test. At the end of the day, the absolute number of utterances for dev/test will be more important than percentages. For English, the dataset is already quite large, so it would probably be ok to make dev a bit smaller than test, e.g. 70/10/20.

Another question would be if you want to fix the dev/test sets at some point, so that WER numbers stay comparable when more (train) data is added to the corpus and you're measuring on the same set. On the other hand, I imagine that once dev/test is large enough, adding a bit of data to it will have an negligible effect on WER numbers.

@kdavis-mozilla
Copy link
Contributor

@bmilde I'd agree that the absolute number is more important, but I'd also note that Fisher and Switchboard are very different sizes, they differ by about a factor of 10. That's why I selected those examples.

I don't think fixing the dev/test sets is a good idea. There are various reasons for this, for example:

  • Consider the case in where the browsers switch their audio processing pipeline. The old samples are not processed with the new pipeline. Thus training on the new data will not be as useful for the old dev/test samples.
  • Consider the case where Common Voice become suddenly popular in Scotland, while the old dev/test data sets don't have any people with Scottish accents. Thus training on the new data will not be useful for the old dev/test data sets.
  • Consider the case where Common Voice become suddenly unpopular in Scotland, the old dev/test data sets are full of Scottish accents but the new test set basically contains no Scottish accents.

The data sets are however fixed at release. In other words, for say the 1.1 release of Common Voice there will be some train/dev/test split that everyone can compare against. This split will be different for the 1.2 release and the 1.3 release but 1.1, 1.2, and 1.3 can each be referred to when quoting WER number for example.

@bmilde
Copy link
Author

bmilde commented Feb 20, 2018

@kdavis-mozilla Just wanted to point that usually speech corpora have fixed dev/tests, but on the other hand yes, they are usually not continuously extended anyway. I agree, keeping dev/test dynamic (but versioned) could be fitting for an ongoing collection effort. You make a good point and actually also point out another problem: should something be done to counter data imbalance in dev/test? Usually the most problematic variable in a speech corpus is gender. A quick glance at the fisher paper reveals that they controlled for the male:female ratio (at least somewhat): https://pdfs.semanticscholar.org/a723/97679079439b075de815553c7b687ccfa886.pdf

I looked at the meta data of v1, the current ratio seems to be at 75%/25%, so the imbalance is quite high.

Concerning the % split in fisher and switchboard, as far as I know the train / test / dev split emerged after recording the data and in the case of switchboard hub5'00 is usually used to report WER numbers, consisting of 20 previously unreleased telephone conversations. I am not too familiar with the fisher dataset, but was under the impression that there is no official split on it and people usually train on fisher+swbd and report WER numbers on swbd/Hub5'00, too.

@kdavis-mozilla
Copy link
Contributor

@bmilde Yeah I agree the male:female ratio is not ideal. We'll need to figure out how to address that dynamically, i.e. such that each released version of the corpus has a reasonable balance. Maybe it's culling data before release? Maybe it's some type of dynamic balancing? I'm not sure, but I know we need to address the problem.

On Fisher, interesting. I don't remember if there was/is a pre-existing split or we made the split in our Fisher importers, something I should check... Yeah, you're right; it's our split instead of a LDC given split. (LDC dropped the ball there.)

@Gregoor So I guess there's at least one more issue that has to be addressed gender balance.

Also we have now an additional problem, some people do not report demographic info, in particular gender. So for some people we simply don't know what their gender is. Also, we've the related question: Do male and females not report demographic info in the same ratio?

@kdavis-mozilla
Copy link
Contributor

kdavis-mozilla commented Feb 21, 2018

@bmilde The more I think about this, other issues are coming to mind.

The data set will have multiple audiences. This will at least include researchers and ASR companies.

Researchers, generally, will be interested in training on the data set and saying "We have SoTA results with WER 0.1% on the Common Voice v1.1 test set."

ASR Companies may be interested in doing the same thing. However, ASR companies generally will know their demographics, 60% women 40% men or 25% women 75% men or 75% Scottish 15% Non-Scottish or some other ratio. So companies will not necessarily be interested in having a 50% women 50% men split, but they will be interested in having as much data as possible and then creating a split that fits their demographics.

This suggests that we should release the raw data then release meta-data that describes the official de-biased data set that researchers should benchmark on. For example, the meta-data would say use samples 127, 187, 400... for the training set, 78, 23, 99,... for the dev set and 1001, 2003,... for the test set.

@bmilde
Copy link
Author

bmilde commented Feb 22, 2018

@kdavis-mozilla For the train set, the imbalance is not a problem and practitioners using it to train can decide on their own if it makes sense to do something about it or if they just train as is. But on the dev/test set, you would like it to be as unbiased as possible, so that you don't overfit e.g. to male speakers. Companies and researches a-like would want to show that they good results on the official test sets and if they are highly imbalanced that skews results a bit.

Maybe it is good idea to somehow require meta data for people that speak sentences for dev/test (and reassign them to train, if they don't want to provide it)? And also to check some categories in the curation step you have?

If every test and dev entry has meta data, you will be able to do nice experiments and result tables that e.g. shows WER separately on gender, age groups, or Scottish speakers vs. American ones. That also means you could use that to tune your model to a particular setting of interest. You could still do this if just a part of the data has meta-data... just checked it, the percentage is better than I thought, on the current v1 test set 38% disclose their gender. Would be still nicer to have it on every entry in dev/test.

It might also be possible to do the gender classification automatically (e.g. with LSTMs or CNNs), I have seen this to work well on clean corpora (I think I've seen something like 1-2% error rate on that). I actually had some students try it out on common voice v1 data set along the other meta data and it didn't work as well, but maybe the network needs to be tuned a bit more (20%+ error rates so far).

@kdavis-mozilla
Copy link
Contributor

@bmilde I agree we should preferentially select people who provide demographic information for dev and test, as it will make the data set far more useful. Also, I agree we should try to balance the dev and test.

However, I think we should strive to balance train too.

As you mention, Researcher A and Researcher B can use whatever training data they see fit, then compute WER on dev and/or test. However, doing so yields subsequent comparisons of their WER's, as model comparison proxy, meaningless, as they've trained on different data sets.

Alternatively, Researcher A and Researcher B can train on a biased train. In this case, subsequent comparison of their WER's express which researcher's algorithm had a prior best tuned to dev and test. Such results may look good initially but my not generalize to, say, other languages.

Alternatively, Researcher A and Researcher B can train on a balanced train. In this case, subsequent comparison of their WER's express which researcher's algorithm is better able to learn from the data and thus better expresses which algorithm has a higher likelihood of generalizing to, say, other languages.

Also, I think we should strive to have as many people as possible in train with demographic information. This will allow us to have something like a "list of ingredients" on train, allowing people training on train to know what to expect from their models, i.e. this should work well on such and such demographics.

PS: I don't doubt that it should be possible to do gender classification automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants