-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAD does not handle almost complete silence #74
Comments
Unless I am missing something, there is not so much we can do about it...
I'll check if this can be improved by tuning some parameters of the VAD. |
I have also played with the threshold paramter, but for whatever reason it didn't solve the problem. If it cannot be improved with other parameters for Silero VAD, I would measure the energy in each segment that Silero returned as speech and remove those where the level is below a certain threshold. It's almost funny that in order to get rid of whisper hallucination we have some Silero hallucination. |
ahah indeed, all neural nets hallucinating. I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). A solution can be to zero out parts of the signal that are "almost zero for sometime". A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. |
For me it makes perfect sense to have a first pass which detects almost silence. Then the noisy parts can be processed by a model. By the way, what does the no_speech_threshold/logprob_threshold stuff? Sounds like it would also deal with silence. |
In my experience |
You mentioned that you have something working. If you like I can play with it to see if it works in other samples. |
You probably refer to:
I just meant I manually zeroed out the first 5 minutes of audio, knowing that it was a "almost zero" part. |
First something important I forgot to mention in that thread: you can use option I created a branch with an attempt to integrate |
Thanks, great that you could do it so quickly. I did a first round of testing and it produces correct result where Silero VAD had problems. It seems that this energy-based approach is a better fit with whisper because we don't have a noise but a silence problem. I will feed in more samples. |
I also came across a bad VAD prediction on a completely silent recording. Maybe this problem can be partially solved with Pydub silent detection? I think this may be the first operation before VAD. And eventually combine the time intervals from this module with the VAD predictions. |
Thanks @traidn for spotting another VAD method. |
Hi @Jeronymous and @freddyertl , I'm getting an error:
Have you come across similar error by any chance? |
No, I have version 0.2.0 of auditok. |
That was it -- I upgraded to 0.2.0 and it went through. Thanks! |
Is there a way for me to verify which branch my whisper_timestamped is installed from? I believe I have finalised the installation from auditok branch but just need to make sure. Thanks! |
You can call |
This looks to me like the exact same issue I encountered when I upgraded from silero-vad V3 to V4. Went back to V3 and no problems since then. Context: used it to remove non-spoken parts from thousands of different podcasts, stream vods and youtube audio over the last few years to listen on the go. |
@IntendedConsequence how do you go back to Silero v3? Is it by pointing the repo_or_dir in this call:
Thanks. |
Thank you @IntendedConsequence for sharing your experience! |
OK I found a way, with |
@dgoryeo @Jeronymous I addressed your questions in the PR comment link. Copying here for context and so you don't have to pointer-chase it
|
Thanks @IntendedConsequence ! |
Since version 1.14.1, several VAD methods can be used. The same default method is used if vad is True but one can specify:
This is documented in the README |
In the attached sample, there is almost perfect silence at the beginning. Still there are hallucinated words.
whisper_timestamped jon.wav --model medium.en --language en --verbose True --accurate --output_dir . --output_format txt,json --vad True --detect_disfluencies True
jon.zip
The text was updated successfully, but these errors were encountered: