Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

Open
alokprasad opened this issue Oct 16, 2019 · 18 comments

Comments

@alokprasad
Copy link

alokprasad commented Oct 16, 2019

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):18.04
  • TensorFlow installed from (our builds, or upstream TensorFlow): Upstream
  • TensorFlow version (use command below): Tf1.13
  • Python version: 3.7
  • Bazel version (if compiling from source):NA
  • GCC/Compiler version (if compiling from source):NA
  • CUDA/cuDNN version:NA
  • GPU model and memory:NA
  • Exact command to reproduce:
./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio t3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=67610

should an hold on the way
./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio st3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=69210

what should one hold on the way

Description:
I downloaded samples wav from release folder of deepspeech client and stripped some audio , so that for human hear it still recognizable , but when feeded to ds client recognition do not work for first word
eg. should an hold on the way
if i added extra silence in this trimmed audio in front , about 800 samples ( 5ms)
then recognition for works/close to first word
eg after adding silence.
what should one hold on the way

@lissyx
Copy link
Collaborator

lissyx commented Oct 18, 2019

@alokprasad For the sake of reproductibility, could you share your trimmed and trimmed+fixed audio samples ?

@lissyx
Copy link
Collaborator

lissyx commented Oct 24, 2019

@alokprasad Ping?

@a-lunev
Copy link

a-lunev commented Oct 28, 2019

@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.

In my experiments I intentionally not used LM / TRIE.

The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic.
My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.

Then I performed the following experiment:

  1. I took a wav file, let's name it ORIGINAL_AUDIO.
  2. If I feed it to client.py, the recognition is quite accurate.
  3. Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO.
  4. Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.

Thus I've inferred that the issue is not with my laptop mic.

Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.

Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO.
I listened to the SAVED_AUDIO. It was good. I was able to hear every word.
Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.

After that I found this current issue #2443.
Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained.
I did the following:

  1. took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz)

  2. cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file

  3. generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file

  4. fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:

$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00903s.
Running inference.
n the way

As you can see, "o" letter is missing in the output (should be "on the way").

  1. fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav 
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00938s.
Running inference.
on the way

As you can see, this time the recognition is 100% correct.
Thus adding a silence in the beginning helped.

I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something).
test_2443.zip

I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.

I hope it would be enough to reproduce the issue and find the cause.

@lissyx
Copy link
Collaborator

lissyx commented Oct 28, 2019

@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.

In my experiments I intentionally not used LM / TRIE.

The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic.
My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.

Then I performed the following experiment:

1. I took a wav file, let's name it ORIGINAL_AUDIO.

2. If I feed it to client.py, the recognition is quite accurate.

3. Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO.

4. Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.

Thus I've inferred that the issue is not with my laptop mic.

Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.

Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO.
I listened to the SAVED_AUDIO. It was good. I was able to hear every word.
Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.

After that I found this current issue #2443.
Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained.
I did the following:

1. took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz)

2. cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file

3. generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file

4. fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00903s.
Running inference.
n the way

As you can see, "o" letter is missing in the output (should be "on the way").

1. fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav 
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00938s.
Running inference.
on the way

As you can see, this time the recognition is 100% correct.
Thus adding a silence in the beginning helped.

I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something).
test_2443.zip

I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.

I hope it would be enough to reproduce the issue and find the cause.

Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.

@a-lunev
Copy link

a-lunev commented Oct 29, 2019

Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.

I've just listened to the original (not modified version) 4507-16021-0012.wav and compared it with my modified wav files.
As for me, "on the way" sounds totally equal.
When I cut the beginning part of the sentence, I tried to keep "on the way" segment untouched.
You could compare all 3 wav files by e.g. Audacity for details (waveform or spectrogram side-by-side).

@lissyx
Copy link
Collaborator

lissyx commented Oct 29, 2019

As for me, "on the way" sounds totally equal.

Ask someone blindly, I'm not sure you will get the same results.

@lissyx
Copy link
Collaborator

lissyx commented Oct 29, 2019

@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.

If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.

@alokprasad
Copy link
Author

@lissyx
Sorry for late response
here are the samples

  1. trimmed

  2. silence added to above trimmed file

https://soundcloud.com/alok-prasad-213091558/sets/deepspeech-test-files

actual utterance in the speech file is "why should one hold on the way"
but when fed to deepspeech native original wav gives ouput "what should one hold on the way"
( maybe lm issue)

1>Trimmed
trimmed_4507-16021-0012.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=67714
should an hold on the way

2>When silence is appended
silence_added_at_start_and_trimmed_4507-16021-0012.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=69210
what should one hold on the way

I think this has to be addressed at training level specially when we have augmentation in place,
as most of this ASR will be used in conjunction with some sort of VAD in place ( webrtc or rnnoise)
before it detects the speech some frames may be some silence or speech is already lost, if we buffer previous frames have seen issues asr recognition ( specially if we speak very fast )

@a-lunev
Copy link

a-lunev commented Oct 30, 2019

@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.

If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.

I suppose some debug / investigation is required to determine the real cause of the issue. As soon as the cause is determined, the appropriate decision could be made.

@lissyx
Copy link
Collaborator

lissyx commented Oct 30, 2019

Yep, they was my point 😊

@alokprasad
Copy link
Author

my usecase is wakeword + speech , where my system feeds (streaming) audio to deepspeech to detect wakeword as soon as it is detected next frame onwards it feeds audio to another instance of deepspeech
but if gap between wakeword + speech is very less than initial words is missed.

eg. "Lucifer, why should one hold on the way" => ds will recognised it as "should one hold on the way"
what is suggestion , should i change the feeder code for augmentation to remove the silence or in fact trim some audio ( few frames ) and do training?

@lissyx
Copy link
Collaborator

lissyx commented Oct 31, 2019

@alokprasad I guess in your case it might be better you change your feeding code yep.

@reuben
Copy link
Contributor

reuben commented Nov 13, 2019

librosa has some silence trimming functionality that could be useful for cleaning up a dataset that has too much silence, if that's what's affecting model performance: https://librosa.github.io/librosa/generated/librosa.effects.trim.html

@alokprasad
Copy link
Author

@reuben amount of silence here is very small .not sure even after removing silence i above issue
wont be resolved.Probably with Augmentation we have to chop few initial frames of few samples during
training making it robust for ASR.

@alokprasad
Copy link
Author

@reuben , If silence is all zero deepspeech do not work ,
eg for utterance "Go back " - > Deepspeech gives "back"
but if all zero silence of 100ms is added it agains gives back result as "back"
if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".

Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech
works better.

@problemSolvingProgramming

Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.

For example, when I say:
HI ----> i
you are ->you are

hello -> halow

How can I increase the accuracy or efficiency of speech to text conversion?

I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?

When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..

@lissyx
Copy link
Collaborator

lissyx commented Nov 7, 2020

Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.

For example, when I say:
HI ----> i
you are ->you are

hello -> halow

How can I increase the accuracy or efficiency of speech to text conversion?

I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?

When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..

Please stop your spam on existing Github issues and use Discourse for support after reading the documentation.

@GSchowalter
Copy link

@reuben , If silence is all zero deepspeech do not work ,
eg for utterance "Go back " - > Deepspeech gives "back"
but if all zero silence of 100ms is added it agains gives back result as "back"
if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".

Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech
works better.

How are you adding silence to the mic stream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants