Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

alokprasad · 2019-10-16T11:43:51Z

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

Have I written custom code (as opposed to running examples on an unmodified clone of the repository):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):18.04
TensorFlow installed from (our builds, or upstream TensorFlow): Upstream
TensorFlow version (use command below): Tf1.13
Python version: 3.7
Bazel version (if compiling from source):NA
GCC/Compiler version (if compiling from source):NA
CUDA/cuDNN version:NA
GPU model and memory:NA
Exact command to reproduce:

./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio t3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=67610

should an hold on the way

./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio st3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=69210

what should one hold on the way

Description:
I downloaded samples wav from release folder of deepspeech client and stripped some audio , so that for human hear it still recognizable , but when feeded to ds client recognition do not work for first word
eg. should an hold on the way
if i added extra silence in this trimmed audio in front , about 800 samples ( 5ms)
then recognition for works/close to first word
eg after adding silence.
what should one hold on the way

The text was updated successfully, but these errors were encountered:

lissyx · 2019-10-18T14:28:56Z

@alokprasad For the sake of reproductibility, could you share your trimmed and trimmed+fixed audio samples ?

lissyx · 2019-10-24T14:34:31Z

@alokprasad Ping?

a-lunev · 2019-10-28T20:53:49Z

@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.

In my experiments I intentionally not used LM / TRIE.

The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic.
My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.

Then I performed the following experiment:

I took a wav file, let's name it ORIGINAL_AUDIO.
If I feed it to client.py, the recognition is quite accurate.
Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO.
Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.

Thus I've inferred that the issue is not with my laptop mic.

Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.

Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO.
I listened to the SAVED_AUDIO. It was good. I was able to hear every word.
Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.

After that I found this current issue #2443.
Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained.
I did the following:

took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz)
cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file
generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file
fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:

$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00903s.
Running inference.
n the way

As you can see, "o" letter is missing in the output (should be "on the way").

fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:

$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav 
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00938s.
Running inference.
on the way

As you can see, this time the recognition is 100% correct.
Thus adding a silence in the beginning helped.

I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something).
test_2443.zip

I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.

I hope it would be enough to reproduce the issue and find the cause.

lissyx · 2019-10-28T22:25:28Z

@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.

In my experiments I intentionally not used LM / TRIE.

The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic.
My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.

Then I performed the following experiment:
1. I took a wav file, let's name it ORIGINAL_AUDIO.

2. If I feed it to client.py, the recognition is quite accurate.

3. Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO.

4. Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.
Thus I've inferred that the issue is not with my laptop mic.

Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.

Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO.
I listened to the SAVED_AUDIO. It was good. I was able to hear every word.
Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.

After that I found this current issue #2443.
Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained.
I did the following:
1. took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz)

2. cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file

3. generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file

4. fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00903s.
Running inference.
n the way
As you can see, "o" letter is missing in the output (should be "on the way").
1. fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav 
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00938s.
Running inference.
on the way
As you can see, this time the recognition is 100% correct.
Thus adding a silence in the beginning helped.

I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something).
test_2443.zip

I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.

I hope it would be enough to reproduce the issue and find the cause.

Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.

a-lunev · 2019-10-29T00:40:22Z

Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.

I've just listened to the original (not modified version) 4507-16021-0012.wav and compared it with my modified wav files.
As for me, "on the way" sounds totally equal.
When I cut the beginning part of the sentence, I tried to keep "on the way" segment untouched.
You could compare all 3 wav files by e.g. Audacity for details (waveform or spectrogram side-by-side).

lissyx · 2019-10-29T08:56:20Z

As for me, "on the way" sounds totally equal.

Ask someone blindly, I'm not sure you will get the same results.

lissyx · 2019-10-29T20:03:40Z

@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.

If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.

alokprasad · 2019-10-30T04:40:28Z

@lissyx
Sorry for late response
here are the samples

trimmed
silence added to above trimmed file

https://soundcloud.com/alok-prasad-213091558/sets/deepspeech-test-files

actual utterance in the speech file is "why should one hold on the way"
but when fed to deepspeech native original wav gives ouput "what should one hold on the way"
( maybe lm issue)

1>Trimmed
trimmed_4507-16021-0012.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=67714
should an hold on the way

2>When silence is appended
silence_added_at_start_and_trimmed_4507-16021-0012.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=69210
what should one hold on the way

I think this has to be addressed at training level specially when we have augmentation in place,
as most of this ASR will be used in conjunction with some sort of VAD in place ( webrtc or rnnoise)
before it detects the speech some frames may be some silence or speech is already lost, if we buffer previous frames have seen issues asr recognition ( specially if we speak very fast )

a-lunev · 2019-10-30T09:32:17Z

@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.

If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.

I suppose some debug / investigation is required to determine the real cause of the issue. As soon as the cause is determined, the appropriate decision could be made.

lissyx · 2019-10-30T09:40:21Z

Yep, they was my point 😊

alokprasad · 2019-10-31T04:25:17Z

my usecase is wakeword + speech , where my system feeds (streaming) audio to deepspeech to detect wakeword as soon as it is detected next frame onwards it feeds audio to another instance of deepspeech
but if gap between wakeword + speech is very less than initial words is missed.

eg. "Lucifer, why should one hold on the way" => ds will recognised it as "should one hold on the way"
what is suggestion , should i change the feeder code for augmentation to remove the silence or in fact trim some audio ( few frames ) and do training?

lissyx · 2019-10-31T08:25:33Z

@alokprasad I guess in your case it might be better you change your feeding code yep.

reuben · 2019-11-13T16:46:16Z

librosa has some silence trimming functionality that could be useful for cleaning up a dataset that has too much silence, if that's what's affecting model performance: https://librosa.github.io/librosa/generated/librosa.effects.trim.html

alokprasad · 2019-11-19T03:56:42Z

@reuben amount of silence here is very small .not sure even after removing silence i above issue
wont be resolved.Probably with Augmentation we have to chop few initial frames of few samples during
training making it robust for ASR.

alokprasad · 2019-12-27T16:37:05Z

@reuben , If silence is all zero deepspeech do not work ,
eg for utterance "Go back " - > Deepspeech gives "back"
but if all zero silence of 100ms is added it agains gives back result as "back"
if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".

Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech
works better.

problemSolvingProgramming · 2020-11-07T19:05:00Z

Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.

For example, when I say:
HI ----> i
you are ->you are

hello -> halow

How can I increase the accuracy or efficiency of speech to text conversion?

I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?

When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..

lissyx · 2020-11-07T19:13:47Z

Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.

For example, when I say:
HI ----> i
you are ->you are

hello -> halow

How can I increase the accuracy or efficiency of speech to text conversion?

I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?

When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..

Please stop your spam on existing Github issues and use Discourse for support after reading the documentation.

GSchowalter · 2021-06-17T17:52:01Z

@reuben , If silence is all zero deepspeech do not work ,
eg for utterance "Go back " - > Deepspeech gives "back"
but if all zero silence of 100ms is added it agains gives back result as "back"
if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".

Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech
works better.

How are you adding silence to the mic stream?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

alokprasad commented Oct 16, 2019 •

edited by lissyx

Loading

lissyx commented Oct 18, 2019

lissyx commented Oct 24, 2019

a-lunev commented Oct 28, 2019 •

edited

Loading

lissyx commented Oct 28, 2019

a-lunev commented Oct 29, 2019 •

edited

Loading

lissyx commented Oct 29, 2019

lissyx commented Oct 29, 2019

alokprasad commented Oct 30, 2019

a-lunev commented Oct 30, 2019

lissyx commented Oct 30, 2019

alokprasad commented Oct 31, 2019

lissyx commented Oct 31, 2019

reuben commented Nov 13, 2019

alokprasad commented Nov 19, 2019

alokprasad commented Dec 27, 2019

problemSolvingProgramming commented Nov 7, 2020

lissyx commented Nov 7, 2020

GSchowalter commented Jun 17, 2021

Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. #2443

Comments

alokprasad commented Oct 16, 2019 • edited by lissyx Loading

lissyx commented Oct 18, 2019

lissyx commented Oct 24, 2019

a-lunev commented Oct 28, 2019 • edited Loading

lissyx commented Oct 28, 2019

a-lunev commented Oct 29, 2019 • edited Loading

lissyx commented Oct 29, 2019

lissyx commented Oct 29, 2019

alokprasad commented Oct 30, 2019

a-lunev commented Oct 30, 2019

lissyx commented Oct 30, 2019

alokprasad commented Oct 31, 2019

lissyx commented Oct 31, 2019

reuben commented Nov 13, 2019

alokprasad commented Nov 19, 2019

alokprasad commented Dec 27, 2019

problemSolvingProgramming commented Nov 7, 2020

lissyx commented Nov 7, 2020

GSchowalter commented Jun 17, 2021

alokprasad commented Oct 16, 2019 •

edited by lissyx

Loading

a-lunev commented Oct 28, 2019 •

edited

Loading

a-lunev commented Oct 29, 2019 •

edited

Loading