-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commonvoice results misleading, complete overlap of train/dev/test sentences #2141
Comments
Well, that's the way the corpus was designed by the people at Mozilla (including the overlap in spoken content in the predefined train/dev/test splits). You should address your concerns to the people running this data collection effort (https://voice.mozilla.org/data, https://github.com/mozilla/voice-web). I believe Voxforge is similar in that there is an overlap in prompts. This recipe is just using the corpus in its intended way (using the train/dev/test splits as provided). The results are not misleading, and to be sure, nobody is claiming that test results on this corpus somehow generalize to any other datasets/conditions. EDIT: This isn't really a Kaldi problem; I suggest moving the discussion to kaldi-help: https://groups.google.com/forum/#!forum/kaldi-help |
Thanks for your fast reply! Also, I see that you (@entn-at) wrote the scripts for this corpus. Thank you very much for adapting them so fast! I wrote similar scripts for Eesen (Rnn-ctc) and was puzzled at the WER difference (~16% vs ~4%), but I used Kaldi-lm and it doesn't let you finish training if it discovers the training sentences in dev, thus my difference was that I omitted them in LM training. Despite this not being a Kaldi problem, I'd still suggest the following enhancements to the Kaldi Common Voice recipe:
Ultimately, I agree that this is a corpus problem. I will address my concerns on the mozilla git and link to this Issue (would be nice to keep the discussion here open as well) - this is surely not what they had in mind. |
I would suggest to do two different decoding passes probably. If the corpus
was designed a specific way then for comparability there should be an easy
way to get the reference numbers.
I have certain doubts about the utility or usefulness of the Mozilla
corpora anyway.
Y.
On Jan 10, 2018 1:20 PM, "bmilde" <notifications@github.com> wrote:
Thanks for your fast reply! Also, I see that you (@entn-at
<https://github.com/entn-at>) wrote the scripts for this corpus. Thank you
very much for adapting them so fast!
I wrote similar scripts for Eesen (Rnn-ctc) and was puzzled at the WER
difference (~16% vs ~4%), but I used Kaldi-lm and it doesn't let you finish
training if it discovers the training sentences in dev, thus my difference
was that I omitted them in LM training.
Despite this not being a Kaldi problem, I'd still suggest the following
enhancements to the Kaldi Common Voice recipe:
- Enhance the recipe by adding a true LM (e.g. trained on Cantab) so
that the models can be used on other phrases then these 6000 sentences.
This will make the models useable in speech recognition tasks (what Mozilla
actually intended) despite the corpus problem. Right now you probably
wouldn't be able to decode much else besides these exact sentences and
tuning the recipe is pointless since the best results are probably obtained
with a large LM/FST prior.
- Add a comment about this peculiarity to the scripts and/or results
- Maybe an option to train without the train sentences in the LM to get
better generalization results. I'm currently doing this and the GMM-HMM
WERs are basically 2x as large.
Ultimately, I agree that this is a corpus problem. I will address my
concerns on the mozilla git and link to this Issue (would be nice to keep
the discussion here open as well) - this is surely not what they had in
mind.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX4ndJmXpCR2anqXsqnH0qH3xxXIXks5tJP6YgaJpZM4RZojk>
.
|
The recipe in its current form was more of a research exercise and not intended to produce a set of models fit for more general use/distribution (like the models published at http://www.kaldi-asr.org/models.html). I fully agree that your suggestions would make this recipe more valuable to other users and would be more in line with what the people at Mozilla had intended. With @danpovey's and @jtrmal's approval, we could create a second version (s5b?) of the recipe that includes your proposed changes. In that case, I would like to encourage you to make PRs with your changes using the Cantab LM. Ultimately, though, I somewhat share @jtrmal's doubts regarding the corpus (with its current design). The people working on Mozilla's DeepSpeech implementation seem to be using various corpora for model training (including non-free corpora like Fisher, SWBD, see their import scripts). Perhaps a subset of the CV corpus could be added to |
I don't have super strong opinions about the structure... s5b would be a
reasonable approach, I guess.
kaldi_lm may crash if in the metaparameter optimization it detects that
your dev data was likely taken from the training data. The way this works,
it likely wouldn't crash if even only small proportion of the dev data was
distinct from training.
Likely this means that there is near 100% overlap.
…On Wed, Jan 10, 2018 at 11:17 AM, Ewald Enzinger ***@***.***> wrote:
The recipe in its current form was more of a research exercise and not
intended to produce a set of models fit for more general use/distribution
(like the models published at http://www.kaldi-asr.org/models.html). I
fully agree that your suggestions would make this recipe more valuable to
other users and would be more in line with what the people at Mozilla had
intended. With @danpovey <https://github.com/danpovey>'s and @jtrmal
<https://github.com/jtrmal>'s approval, we could create a second version
(s5b?) of the recipe that includes your proposed changes. In that case, I
would like to encourage you to make PRs with your changes using the Cantab
LM.
Ultimately, though, I somewhat share @jtrmal <https://github.com/jtrmal>'s
doubts regarding the corpus (with its current design). The people working
on Mozilla's DeepSpeech implementation seem to be using various corpora for
model training (including non-free corpora like Fisher, SWBD, see their
import scripts). Perhaps the CV corpus could be added to egs/multi_en?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu0YWZMIHgZh0rsfYJLt137Gy4rq4ks5tJQzTgaJpZM4RZojk>
.
|
Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on that, I think it even gracefully showed an error message. I've also posted the problem here on the mozilla discourse platform: https://discourse.mozilla.org/t/common-voice-v1-corpus-design-problems-overlapping-train-test-dev-sentences/24288 Chance are, they aren't really aware of it. |
My preference would just adding another decoding script into s5 (and
document that in README) to reduce duplication of things.
But it's not worth my time to argue about this, so Ewald, your call.
y.
…On Thu, Jan 11, 2018 at 8:48 AM, bmilde ***@***.***> wrote:
Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on
that, I think it even gracefully showed an error message.
I've also posted the problem here on the mozilla discourse platform:
https://discourse.mozilla.org/t/common-voice-v1-corpus-
design-problems-overlapping-train-test-dev-sentences/24288
Chance are, they aren't really aware of it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKisX0c3xv4KpSyGa8Afz0Pgw9YfcRZ4ks5tJhEqgaJpZM4RZojk>
.
|
I'm not gonna discuss if kaldi_lm is rightful or not, but the usual
discounting methods (KN, GT) have issues with "artificially" looking data,
often artificially generated or grammar-induced data.
Witten-Bell discounting should work in that case (probably not implemented
in kaldi_lm).
y.
…On Thu, Jan 11, 2018 at 8:48 AM, bmilde ***@***.***> wrote:
Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on
that, I think it even gracefully showed an error message.
I've also posted the problem here on the mozilla discourse platform:
https://discourse.mozilla.org/t/common-voice-v1-corpus-
design-problems-overlapping-train-test-dev-sentences/24288
Chance are, they aren't really aware of it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKisX0c3xv4KpSyGa8Afz0Pgw9YfcRZ4ks5tJhEqgaJpZM4RZojk>
.
|
Hi Kaldi folks, I'm the maintainer of the Common Voice project, and as such am responsible for both our small text corpus and improper split. First of all, let me say that it was great to see Common Voice data integrated into the Kaldi project so quickly. Also, big thanks to @bmilde for finding and reporting this bug, and for suggesting a way to fix this on our repo. For a discussion of our plan for addressing this, you can check out that bug. One thing I noticed in this thread:
Both @jtrmal and @entn-at seemed to have this sentiment (perhaps others?). Are the problems other than our small corpus size and train/dev/test split that you are concerned about? We are in the process of updating Common Voice now, and it would be super helpful to get your feedback so we can make our data more useful to your project (and other STT engines). On the topic of text corpora, we are working with the University of Illinois to get them to release the Flickr30K text corpus under cc-0, which would allow us to use it's 100K some sentences. For an example of what these look like, you can see here: If it's not too much to ask, we would love expert feedback on if the above corpus would be helpful for speech engines. You'll notice the sentences are all description of images, so the utterances have a lot of repeated words. Unfortunately, I am not a speech technologist, so I don't have any intuition as to whether this is the right data or not for utterances. Again, any info you could provide would be very helpful. Thanks again, and thanks for maintaining Kaldi! |
Here are some suggestions:
|
As Ognjen said, unless you're looking for models to be used in a
command/control scenario (like Echo or Home), but a more general model,
you'd be looking for spontaneous speech, ideally among people (of balanced
age and gender). The hard part is transcribing it, since badly (or
inconsistently) transcribed data is not of much use (which is also the
reason why well transcribed data is so expensive). Here's a comparable real
life analogy: say you're learning a new language; you may understand the
news on TV (read-out, professional speakers), but have no clue when
somebody talks to you on the street (spontaneous, slang, accent, background
noise, ...).
Korbinian.
…On Fri, Jan 26, 2018 at 6:59 PM, Ognjen Todic ***@***.***> wrote:
Here are some suggestions:
- use wav files (16kHz, 16bits), not mp3
- if possible, disable OS audio feedback in the app that collects the
data; in a number of recordings I've noticed that there were different
types of beeps in the beginning; later on I realized this may have been due
to OS audio feedback, eg. when tapping a button, on Android
- keep speaker information (some sort of hash), across different
recordings
- less focus on reading, more on spontaneous speech (this will end up
being more costly, since speech data will need to be transcribed); you can
ask people to talk about a number of different topics... have two people
talk to each other, or person talk to a bot; you can also take a look at
LDC website and get some inspiration from datasets collected in the past (I
would not necessarily blindly follow their approaches though)
- additional metadata that could be useful: tag/labels for extraneous
speech/noise, etc; device info; user demographics; headset/bluetooth
- capture data in different acoustic environments (and when possible,
capture metadata about the environment as well)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADhueJhnAdRvAmvl5j6OMMbOAIxDjcSZks5tOhJ7gaJpZM4RZojk>
.
|
I completely agree with Ognjen's and Korbinian's suggestions, especially the one about conversational speech. This is indeed the major missing piece when it comes to open speech corpora. At this point collecting more scripted English is not going to be very useful IMO. Except maybe for distant speech or speech in noisy environments, but I'm not familiar with these domains- perhaps someone who is researching robust recognition will chime in. I'd say for clean read English speech the data problem is pretty much solved. For example the LibriSpeech's results for non-accented English, i.e. {dev,test}-clean, are already fairly close to the theoretical minimum- keep in mind that some of those errors are actually due to different spellings of some names etc. If English conversational speech is not an option, then I would suggest to at least concentrate on languages other than English. The coverage there is more than spotty, so even read speech is going to be useful. |
The prompted nature of the collected speech was my main concern as well. Perhaps you can collect transcribed conversational speech via crowd sourcing as follows (I'm posting some ideas here and explicitly want to invite constructive criticism/comments by everybody; I know this has nothing to do with Kaldi per se, but people here have a lot of experience with data collected for ASR and can give valuable advice to @mikehenrty):
Keeping the required effort low is important for crowd sourcing, but as has been pointed out, badly or inconsistently transcribed speech isn't that useful. There is no question that a professionally designed and transcribed speech corpus is preferable in every way, but I'd be interested in feedback/suggestions (also and especially in the form of "This won't work because..."). Disclaimer: I'm not involved in this data collection effort, I'm just trying to be helpful. |
Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread. Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!
We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).
Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.
We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.
Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track. To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something," but with audio). That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.
Good idea! We have a bug for this:
Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see? |
I think the main problem (at least IMO) is that you got the whole kinda
backwards -- IMO you should start with a solid use case and then drive the
corpora acquisition w.r.t. to the use case. After that, you can start
thinking of expanding -- language-wise, use-case-wise...
For example, in what way wasn't the librispeech corpus and/or the models
sufficient, if you actually tested it? Or other English models/corpora? Why
did you decide to go for English -- do you know there are a couple of solid
AMs freely available and scripts (and source corpora in some cases) for
training it.
The use cases are very important as they will drive the way you will gather
the corpus. Doing it the other way -- recording speech and hoping that
some machine-learning magic (which is/was my impression about the way you
did it) will make it useful can end up in a bitter disappointment. Lacking
evident use-case and certain design naivety was the reason the corpus does
not look very useful.
For example, consider that the metadata can be very useful and it would be
worthy of considering what metadata record right in the early stages of the
corpus design. If this is done right, you can provide fairly cheaply
(storage-wise, computational-wise... ) specific (adapted) models for a
given platform (mobile/desktop) or some other "slice" of the hw/sw
ecosystem. Or speaker-adapted models in the longer term. Yes, the metadata
are useful.
Compared to this, the fact if it's recorded in mp3 or that you can hear
noises in the background is not something I would care too much about.
It is better to record lossless but I don't think it should be a pivot. (My
personal opinion).
y.
…On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty ***@***.*** > wrote:
Wow, great feedback and ideas here everyone. Thank you for lending us your
brains. I agree with @entn-at <https://github.com/entn-at> that perhaps
the Kaldi github is not the best place to discuss making Common Voice
better (for instance, I would rather see this on our Discourse channel
<https://discourse.mozilla-community.org/c/voice> 🤓 ). That said, part
of the goals of Common Voice is to make open source speech technology
better, so it's useful for us to come to where Kaldi folks are. I apologize
if it feels like we are hijacking this thread.
Ok, on to the suggestions. I would like to try to comment on all the
thoughts and ideas I see here. If I miss anything, I apologize. Also, if
anyone thinks of anything else to add, by all means keep the ideas coming!
use wav files (16kHz, 16bits), not mp3
We record the audio from a variety of browsers, OSes, and devices. Sadly
this gives us audio in many formats and bit rates. In addition to this, we
must support audio playback on these devices (for human audio validation).
MP3 gave us a good trade-off between browser/device support (so we didn't
have to transcode on the fly every time a user wanted to listen/validate a
clip), size of file (for downloading data), and quality. We spoke with both
our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a
speech start-up), and neither team seemed concerned about the file format
(artifacts and all) or bitrate. I would love to hear some thoughts about
how important this is for Kaldi (or any other speech projects for that
matter).
if possible, disable OS audio feedback in the app that collects the data;
in a number of recordings I've noticed that there were different types of
beeps in the beginning; later on I realized this may have been due to OS
audio feedback, eg. when tapping a button, on Android
Our deepspeech team specifically did not want us to remove this from the
data, the argument being it would make the resultant engine more resilient
to this kind of thing.
keep speaker information (some sort of hash), across different recordings
We do indeed have this information, but have opted not to release it just
yet as we are still trying to understand privacy implications. Will these
speaker id's be useful for speech-to-text? We realize they are useful for
things like speaker identification and/or speech synthesis, but that is not
the focus for Common Voice at this time.
less focus on reading, more on spontaneous speech...
Good suggestion on taking inspiration from LDC. Indeed, we want to create
something like the Fisher Corpus using our website, but that requires a
rethink of our current (admittedly simplistic) interaction model. Big
thanks to @entn-at <https://github.com/entn-at> for the thoughtful
comments on how we could make this work. I completely agree that level of
effort is something we need to pay close attention to. And if we can make
this somehow fun, or useful in a way besides providing data (like talking
with friend), then we are on the right track.
To this end, we are currently in the design process for "Collecting
Organic Speech." We started with many big and sometimes crazy ideas (accent
trainers, karaoke dating apps, a necklace with a button that can submit
last 15 seconds of audio), and narrowed in on a few ideas we want to
explore. Our current thinking is that we want to connect individuals who
use the site and have them speak to each other somehow. We also want this
to be fun, so we will have prompts and perhaps games (e.g. "Draw Something
<https://www.zynga.com/games/draw-something>," but with audio).
That said, the time horizon would be late 2018 at the earliest. Our
current engineering focus is on making Common Voice multi-language, and
also increasing engagement on the site.
additional metadata that could be useful: tag/labels for extraneous
speech/noise, etc; device info; user demographics; headset/bluetooth
Good idea! We have a bug for this:
common-voice/common-voice#814 <common-voice/common-voice#814>
capture data in different acoustic environments (and when possible,
capture metadata about the environment as well)
Right now we know browser, mobile vs. desktop, and sometimes OS. Is there
any other metadata you'd like to see?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk>
.
|
> Lacking evident use-case and certain design naivety was the reason the
corpus does not look very useful.
I should say
Lacking evident use-case and certain design naivety was the reason why I
said the corpus does not look very useful.
y.
…On Tue, Jan 30, 2018 at 9:51 PM, Jan Trmal ***@***.***> wrote:
I think the main problem (at least IMO) is that you got the whole kinda
backwards -- IMO you should start with a solid use case and then drive the
corpora acquisition w.r.t. to the use case. After that, you can start
thinking of expanding -- language-wise, use-case-wise...
For example, in what way wasn't the librispeech corpus and/or the models
sufficient, if you actually tested it? Or other English models/corpora? Why
did you decide to go for English -- do you know there are a couple of solid
AMs freely available and scripts (and source corpora in some cases) for
training it.
The use cases are very important as they will drive the way you will
gather the corpus. Doing it the other way -- recording speech and hoping
that some machine-learning magic (which is/was my impression about the way
you did it) will make it useful can end up in a bitter disappointment.
Lacking evident use-case and certain design naivety was the reason the
corpus does not look very useful.
For example, consider that the metadata can be very useful and it would be
worthy of considering what metadata record right in the early stages of the
corpus design. If this is done right, you can provide fairly cheaply
(storage-wise, computational-wise... ) specific (adapted) models for a
given platform (mobile/desktop) or some other "slice" of the hw/sw
ecosystem. Or speaker-adapted models in the longer term. Yes, the metadata
are useful.
Compared to this, the fact if it's recorded in mp3 or that you can hear
noises in the background is not something I would care too much about.
It is better to record lossless but I don't think it should be a pivot.
(My personal opinion).
y.
On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty <
***@***.***> wrote:
> Wow, great feedback and ideas here everyone. Thank you for lending us
> your brains. I agree with @entn-at <https://github.com/entn-at> that
> perhaps the Kaldi github is not the best place to discuss making Common
> Voice better (for instance, I would rather see this on our Discourse
> channel <https://discourse.mozilla-community.org/c/voice> 🤓 ). That
> said, part of the goals of Common Voice is to make open source speech
> technology better, so it's useful for us to come to where Kaldi folks are.
> I apologize if it feels like we are hijacking this thread.
>
> Ok, on to the suggestions. I would like to try to comment on all the
> thoughts and ideas I see here. If I miss anything, I apologize. Also, if
> anyone thinks of anything else to add, by all means keep the ideas coming!
>
> use wav files (16kHz, 16bits), not mp3
>
> We record the audio from a variety of browsers, OSes, and devices. Sadly
> this gives us audio in many formats and bit rates. In addition to this, we
> must support audio playback on these devices (for human audio validation).
> MP3 gave us a good trade-off between browser/device support (so we didn't
> have to transcode on the fly every time a user wanted to listen/validate a
> clip), size of file (for downloading data), and quality. We spoke with both
> our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a
> speech start-up), and neither team seemed concerned about the file format
> (artifacts and all) or bitrate. I would love to hear some thoughts about
> how important this is for Kaldi (or any other speech projects for that
> matter).
>
> if possible, disable OS audio feedback in the app that collects the data;
> in a number of recordings I've noticed that there were different types of
> beeps in the beginning; later on I realized this may have been due to OS
> audio feedback, eg. when tapping a button, on Android
>
> Our deepspeech team specifically did not want us to remove this from the
> data, the argument being it would make the resultant engine more resilient
> to this kind of thing.
>
> keep speaker information (some sort of hash), across different recordings
>
> We do indeed have this information, but have opted not to release it just
> yet as we are still trying to understand privacy implications. Will these
> speaker id's be useful for speech-to-text? We realize they are useful for
> things like speaker identification and/or speech synthesis, but that is not
> the focus for Common Voice at this time.
>
> less focus on reading, more on spontaneous speech...
>
> Good suggestion on taking inspiration from LDC. Indeed, we want to create
> something like the Fisher Corpus using our website, but that requires a
> rethink of our current (admittedly simplistic) interaction model. Big
> thanks to @entn-at <https://github.com/entn-at> for the thoughtful
> comments on how we could make this work. I completely agree that level of
> effort is something we need to pay close attention to. And if we can make
> this somehow fun, or useful in a way besides providing data (like talking
> with friend), then we are on the right track.
>
> To this end, we are currently in the design process for "Collecting
> Organic Speech." We started with many big and sometimes crazy ideas (accent
> trainers, karaoke dating apps, a necklace with a button that can submit
> last 15 seconds of audio), and narrowed in on a few ideas we want to
> explore. Our current thinking is that we want to connect individuals who
> use the site and have them speak to each other somehow. We also want this
> to be fun, so we will have prompts and perhaps games (e.g. "Draw
> Something <https://www.zynga.com/games/draw-something>," but with audio).
>
> That said, the time horizon would be late 2018 at the earliest. Our
> current engineering focus is on making Common Voice multi-language, and
> also increasing engagement on the site.
>
> additional metadata that could be useful: tag/labels for extraneous
> speech/noise, etc; device info; user demographics; headset/bluetooth
>
> Good idea! We have a bug for this:
> common-voice/common-voice#814 <common-voice/common-voice#814>
>
> capture data in different acoustic environments (and when possible,
> capture metadata about the environment as well)
>
> Right now we know browser, mobile vs. desktop, and sometimes OS. Is there
> any other metadata you'd like to see?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2141 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk>
> .
>
|
I'm not actively participating in this conversation, but I do want to
comment that "shorten" is probably the best lossless audio codec I know of
in speech, if you're concerned about the size of wav files.
By the way, @jtrmal, from my time at NIPS, there seems to be a trend that
having more data is important for futures directions of research (this work
comes to mind
<http://research.baidu.com/deep-learning-scaling-predictable-empirically/>:
i.e.,
loss appears to go down logarithmically with your data size). A lot of
people do think that larger datasets are important. It would be interesting
to understand where Baidu's 10,000 hour dataset comes from. Last I heard
(two years ago, mind!), they got a lot via Mechanical Turk jobs where
people spoke sentences shown to them. If the large datasets are built using
very similar data, this work research direction might be less interesting
than people are assuming.
Also, I do remember that when I got started in speech recognition, there
were no serious open datasets (maybe AMI or TEDLIUM?), HTK didn't release
any recipes, and Kaldi was still hosted in an svn repo. So I do think there
is educational value to an open dataset. Though of course librispeech has
already serves the purpose of an educational dataset quite a bit.
Finally, for some reason many decision makers I've met seem to be
irrationally opposed to buying corpora from the LDC. I agree the prices are
not feasible for individuals, but the prices ($7k for 2000 hours for SWBD?)
are fairly reasonable when contrasted with an institution's costs for
engineers, scientists, and computer equipment. Of course, I'm not sure it's
fair that Mozilla should be footing the bill by doing a lot of this work
for free, either... (I'm watching their DeepSpeech repo; it is quite
popular)
I've said far more than I expected. Oh well.
…On Tue, Jan 30, 2018 at 6:54 PM, jtrmal ***@***.***> wrote:
>> Lacking evident use-case and certain design naivety was the reason the
corpus does not look very useful.
I should say
Lacking evident use-case and certain design naivety was the reason why I
said the corpus does not look very useful.
y.
On Tue, Jan 30, 2018 at 9:51 PM, Jan Trmal ***@***.***> wrote:
> I think the main problem (at least IMO) is that you got the whole kinda
> backwards -- IMO you should start with a solid use case and then drive
the
> corpora acquisition w.r.t. to the use case. After that, you can start
> thinking of expanding -- language-wise, use-case-wise...
> For example, in what way wasn't the librispeech corpus and/or the models
> sufficient, if you actually tested it? Or other English models/corpora?
Why
> did you decide to go for English -- do you know there are a couple of
solid
> AMs freely available and scripts (and source corpora in some cases) for
> training it.
>
> The use cases are very important as they will drive the way you will
> gather the corpus. Doing it the other way -- recording speech and hoping
> that some machine-learning magic (which is/was my impression about the
way
> you did it) will make it useful can end up in a bitter disappointment.
> Lacking evident use-case and certain design naivety was the reason the
> corpus does not look very useful.
>
> For example, consider that the metadata can be very useful and it would
be
> worthy of considering what metadata record right in the early stages of
the
> corpus design. If this is done right, you can provide fairly cheaply
> (storage-wise, computational-wise... ) specific (adapted) models for a
> given platform (mobile/desktop) or some other "slice" of the hw/sw
> ecosystem. Or speaker-adapted models in the longer term. Yes, the
metadata
> are useful.
>
> Compared to this, the fact if it's recorded in mp3 or that you can hear
> noises in the background is not something I would care too much about.
> It is better to record lossless but I don't think it should be a pivot.
> (My personal opinion).
>
> y.
>
>
> On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty <
> ***@***.***> wrote:
>
>> Wow, great feedback and ideas here everyone. Thank you for lending us
>> your brains. I agree with @entn-at <https://github.com/entn-at> that
>> perhaps the Kaldi github is not the best place to discuss making Common
>> Voice better (for instance, I would rather see this on our Discourse
>> channel <https://discourse.mozilla-community.org/c/voice> 🤓 ). That
>> said, part of the goals of Common Voice is to make open source speech
>> technology better, so it's useful for us to come to where Kaldi folks
are.
>> I apologize if it feels like we are hijacking this thread.
>>
>> Ok, on to the suggestions. I would like to try to comment on all the
>> thoughts and ideas I see here. If I miss anything, I apologize. Also, if
>> anyone thinks of anything else to add, by all means keep the ideas
coming!
>>
>> use wav files (16kHz, 16bits), not mp3
>>
>> We record the audio from a variety of browsers, OSes, and devices. Sadly
>> this gives us audio in many formats and bit rates. In addition to this,
we
>> must support audio playback on these devices (for human audio
validation).
>> MP3 gave us a good trade-off between browser/device support (so we
didn't
>> have to transcode on the fly every time a user wanted to
listen/validate a
>> clip), size of file (for downloading data), and quality. We spoke with
both
>> our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a
>> speech start-up), and neither team seemed concerned about the file
format
>> (artifacts and all) or bitrate. I would love to hear some thoughts about
>> how important this is for Kaldi (or any other speech projects for that
>> matter).
>>
>> if possible, disable OS audio feedback in the app that collects the
data;
>> in a number of recordings I've noticed that there were different types
of
>> beeps in the beginning; later on I realized this may have been due to OS
>> audio feedback, eg. when tapping a button, on Android
>>
>> Our deepspeech team specifically did not want us to remove this from the
>> data, the argument being it would make the resultant engine more
resilient
>> to this kind of thing.
>>
>> keep speaker information (some sort of hash), across different
recordings
>>
>> We do indeed have this information, but have opted not to release it
just
>> yet as we are still trying to understand privacy implications. Will
these
>> speaker id's be useful for speech-to-text? We realize they are useful
for
>> things like speaker identification and/or speech synthesis, but that is
not
>> the focus for Common Voice at this time.
>>
>> less focus on reading, more on spontaneous speech...
>>
>> Good suggestion on taking inspiration from LDC. Indeed, we want to
create
>> something like the Fisher Corpus using our website, but that requires a
>> rethink of our current (admittedly simplistic) interaction model. Big
>> thanks to @entn-at <https://github.com/entn-at> for the thoughtful
>> comments on how we could make this work. I completely agree that level
of
>> effort is something we need to pay close attention to. And if we can
make
>> this somehow fun, or useful in a way besides providing data (like
talking
>> with friend), then we are on the right track.
>>
>> To this end, we are currently in the design process for "Collecting
>> Organic Speech." We started with many big and sometimes crazy ideas
(accent
>> trainers, karaoke dating apps, a necklace with a button that can submit
>> last 15 seconds of audio), and narrowed in on a few ideas we want to
>> explore. Our current thinking is that we want to connect individuals who
>> use the site and have them speak to each other somehow. We also want
this
>> to be fun, so we will have prompts and perhaps games (e.g. "Draw
>> Something <https://www.zynga.com/games/draw-something>," but with
audio).
>>
>> That said, the time horizon would be late 2018 at the earliest. Our
>> current engineering focus is on making Common Voice multi-language, and
>> also increasing engagement on the site.
>>
>> additional metadata that could be useful: tag/labels for extraneous
>> speech/noise, etc; device info; user demographics; headset/bluetooth
>>
>> Good idea! We have a bug for this:
>> common-voice/common-voice#814 <common-voice/common-voice#814>
>>
>> capture data in different acoustic environments (and when possible,
>> capture metadata about the environment as well)
>>
>> Right now we know browser, mobile vs. desktop, and sometimes OS. Is
there
>> any other metadata you'd like to see?
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#2141 (comment)
>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-
auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk>
>> .
>>
>
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2141 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEi_UIpCbDKa178g2BZ_owTUar56NjIyks5tP9X1gaJpZM4RZojk>
.
--
Daniel Galvez
http://danielgalvez.me
https://github.com/galv
|
In order to use the stages in the run.sh file the
utils/parse_options.sh file needs to be sourced.
…On 1/28/18, Ewald Enzinger ***@***.***> wrote:
The prompted nature of the collected speech was my main concern as well.
Perhaps you can collect transcribed conversational speech via crowd sourcing
as follows (I'm posting some ideas here and explicitly want to invite
constructive criticism/comments by everybody; I know this has nothing to do
with Kaldi _per se_, but people here have a lot of experience with data
collected for ASR and can give valuable advice to @mikehenrty):
- To collect conversational speech, let users call other users via WebRTC.
Conversation will be easier if users already know each other. There are ways
of recording a WebRTC call, for example using
[RTCMultiConnection](https://github.com/muaz-khan/RTCMultiConnection) and
[RecordRTC](https://github.com/muaz-khan/WebRTC-Experiment/tree/master/RecordRTC).
AFAIK, WebRTC gateways like the open-source [Janus
project](https://janus.conf.meetecho.com/index.html) have plugins for
recording calls. Each caller needs to be informed that the call is being
recorded (and for what purpose), of course. There are obvious privacy issues
for releasing the data and crowd sourcing the transcription effort, and
reminders not to disclose private information only go so far.
- For word-level transcription, use existing LVCSR systems (e.g. based on
Kaldi, Mozilla DeepSpeech, or cloud speech APIs) to segment calls (i.e.,
create relatively short chunks) and transcribe them. Human listeners
verifying these automatically transcribed segments could listen to segments
and provide feedback in multiple ways:
1. Transcription correct: yes/no. This is the least-effort feedback
option.
2. Add buttons (or click-/tap-able areas) for each word and between words.
Users can click/tap on words to indicate insertion/substitution or between
words to indicate deletions. More effort required by the listener.
3. In addition to (ii.), let users correct incorrectly transcribed
segments/enter fully manual transcription. Higher effort required.
Keeping the required effort low is important for crowd sourcing, but as has
been pointed out, badly or inconsistently transcribed speech isn't that
useful. There is no question that a professionally designed and transcribed
speech corpus is preferable in every way, but I'd be interested in
feedback/suggestions (also and especially in the form of "This won't work
because...").
Disclaimer: I'm not involved in this data collection effort, I'm just trying
to be helpful.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#2141 (comment)
|
I was quite surprised to see how low the WERs are for the new Common Voice corpus: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/RESULTS (4ish% TDNN)
Unfortunately, these results seem to be bogus because there is a near complete overlap of train/dev/test sentences and the LM is only trained on the corpus train sentences (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/local/prepare_lm.sh). To make matters worse, there aren't really that many unique sentences in the corpus:
unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355
This can also be easily verified by e.g. grepping "sadly my dream of becoming a squirrel whisperer may never happen" on the original corpus csvs:
Now this is pretty much a terrible design for a speech corpus, but I suggest to exclude the train sentences from the LM completely, to have somewhat more realistic results. I'm currently rerunning the scripts with a Cantab LM without the train sentences and will report back when I have the results.
The text was updated successfully, but these errors were encountered: