-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443
Comments
@alokprasad For the sake of reproductibility, could you share your trimmed and trimmed+fixed audio samples ? |
@alokprasad Ping? |
@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes. In my experiments I intentionally not used LM / TRIE. The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic. Then I performed the following experiment:
Thus I've inferred that the issue is not with my laptop mic. Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad. Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO. After that I found this current issue #2443.
As you can see, "o" letter is missing in the output (should be "on the way").
As you can see, this time the recognition is 100% correct. I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something). I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files. I hope it would be enough to reproduce the issue and find the cause. |
Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable. |
I've just listened to the original (not modified version) 4507-16021-0012.wav and compared it with my modified wav files. |
Ask someone blindly, I'm not sure you will get the same results. |
@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here. If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed. |
@lissyx
https://soundcloud.com/alok-prasad-213091558/sets/deepspeech-test-files actual utterance in the speech file is "why should one hold on the way" 1>Trimmed 2>When silence is appended I think this has to be addressed at training level specially when we have augmentation in place, |
I suppose some debug / investigation is required to determine the real cause of the issue. As soon as the cause is determined, the appropriate decision could be made. |
Yep, they was my point 😊 |
my usecase is wakeword + speech , where my system feeds (streaming) audio to deepspeech to detect wakeword as soon as it is detected next frame onwards it feeds audio to another instance of deepspeech eg. "Lucifer, why should one hold on the way" => ds will recognised it as "should one hold on the way" |
@alokprasad I guess in your case it might be better you change your feeding code yep. |
librosa has some silence trimming functionality that could be useful for cleaning up a dataset that has too much silence, if that's what's affecting model performance: https://librosa.github.io/librosa/generated/librosa.effects.trim.html |
@reuben amount of silence here is very small .not sure even after removing silence i above issue |
@reuben , If silence is all zero deepspeech do not work , Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech |
Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text. For example, when I say: hello -> halow How can I increase the accuracy or efficiency of speech to text conversion? I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not? When I say a word through a microphone, do I already have to make certain settings in the Windows environment?.. |
Please stop your spam on existing Github issues and use Discourse for support after reading the documentation. |
How are you adding silence to the mic stream? |
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
Description:
I downloaded samples wav from release folder of deepspeech client and stripped some audio , so that for human hear it still recognizable , but when feeded to ds client recognition do not work for first word
eg. should an hold on the way
if i added extra silence in this trimmed audio in front , about 800 samples ( 5ms)
then recognition for works/close to first word
eg after adding silence.
what should one hold on the way
The text was updated successfully, but these errors were encountered: