Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAD does not handle almost complete silence #74

Closed
freddyertl opened this issue Apr 13, 2023 · 23 comments
Closed

VAD does not handle almost complete silence #74

freddyertl opened this issue Apr 13, 2023 · 23 comments
Labels
bug Something isn't working

Comments

@freddyertl
Copy link

In the attached sample, there is almost perfect silence at the beginning. Still there are hallucinated words.

whisper_timestamped jon.wav --model medium.en --language en --verbose True --accurate --output_dir . --output_format txt,json --vad True --detect_disfluencies True

jon.zip

@Jeronymous
Copy link
Member

Unless I am missing something, there is not so much we can do about it...
SILERO VAD is wrong, it returns some speech segment on the first 5 minutes where there is indeed nothing.
Namely these segments (in seconds):

[
        {'start': 63.33, 'end': 71.646},
        {'start': 72.738, 'end': 122.942},
        {'start': 124.258, 'end': 133.502},
        {'start': 136.194, 'end': 157.406},
        {'start': 158.402, 'end': 210.75},
        {'start': 211.81, 'end': 242.558},
        {'start': 244.866, 'end': 263.294},
        {'start': 264.706, 'end': 267.966}
]

I'll check if this can be improved by tuning some parameters of the VAD.

@freddyertl
Copy link
Author

I have also played with the threshold paramter, but for whatever reason it didn't solve the problem. If it cannot be improved with other parameters for Silero VAD, I would measure the energy in each segment that Silero returned as speech and remove those where the level is below a certain threshold. It's almost funny that in order to get rid of whisper hallucination we have some Silero hallucination.

@Jeronymous
Copy link
Member

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below).
There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime".
I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD.
Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

image

@freddyertl
Copy link
Author

For me it makes perfect sense to have a first pass which detects almost silence. Then the noisy parts can be processed by a model. By the way, what does the no_speech_threshold/logprob_threshold stuff? Sounds like it would also deal with silence.

@Jeronymous
Copy link
Member

no_speech_threshold is used to remove segments that the Whisper model detects as silence (it has some learnt VAD capabilities). But it's tricky to use.

In my experience logprob_threshold is not doing much.

@freddyertl
Copy link
Author

You mentioned that you have something working. If you like I can play with it to see if it works in other samples.

@Jeronymous
Copy link
Member

You probably refer to:

A solution can be to zero out parts of the signal that are "almost zero for sometime".
I tested, it works. But it's awkward to make that a general solution.

I just meant I manually zeroed out the first 5 minutes of audio, knowing that it was a "almost zero" part.
And it seemed to solve the issue.
I could build a more general solution, to zero out the "almost zero" parts in general, but I find it a bit awkward...

@Jeronymous
Copy link
Member

First something important I forgot to mention in that thread: you can use option --plot to plot the results of the VAD
(it will also plot alignment results segment by segment, so this is for debug, you might want to run it and stop it at one point).

I created a branch with an attempt to integrate auditok instead of silero VAD.
The branch is called feature/auditok_vad.
@freddyertl You can play with this if you want, and post your comment here, or on this pull request: https://github.com/linto-ai/whisper-timestamped/tree/feature/auditok_vad

@freddyertl
Copy link
Author

freddyertl commented Apr 14, 2023

Thanks, great that you could do it so quickly. I did a first round of testing and it produces correct result where Silero VAD had problems. It seems that this energy-based approach is a better fit with whisper because we don't have a noise but a silence problem. I will feed in more samples.

@traidn
Copy link

traidn commented Apr 17, 2023

I also came across a bad VAD prediction on a completely silent recording. Maybe this problem can be partially solved with Pydub silent detection? I think this may be the first operation before VAD. And eventually combine the time intervals from this module with the VAD predictions.

@Jeronymous
Copy link
Member

Thanks @traidn for spotting another VAD method.
If you have a chance, when you see VAD issues, you can maybe try the feature/auditok_vad branch of whisper-timestamped.

@dgoryeo
Copy link

dgoryeo commented Apr 19, 2023

Hi @Jeronymous and @freddyertl , I'm getting an error:

AttributeError: module 'auditok' has no attribute 'split'

Have you come across similar error by any chance?

@Jeronymous
Copy link
Member

No, I have version 0.2.0 of auditok.
what says pip show auditok for you?

@dgoryeo
Copy link

dgoryeo commented Apr 19, 2023

That was it -- I upgraded to 0.2.0 and it went through. Thanks!

@dgoryeo
Copy link

dgoryeo commented Apr 20, 2023

Is there a way for me to verify which branch my whisper_timestamped is installed from? I believe I have finalised the installation from auditok branch but just need to make sure. Thanks!

@Jeronymous
Copy link
Member

You can call whisper_timestamped --version (or in python whisper_timestamped.__version__).
If it's 1.12.17 you're on the auditok branch

@Jeronymous Jeronymous added the bug Something isn't working label Nov 15, 2023
@IntendedConsequence
Copy link

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

image

This looks to me like the exact same issue I encountered when I upgraded from silero-vad V3 to V4. Went back to V3 and no problems since then. Context: used it to remove non-spoken parts from thousands of different podcasts, stream vods and youtube audio over the last few years to listen on the go.

@dgoryeo
Copy link

dgoryeo commented Nov 18, 2023

@IntendedConsequence how do you go back to Silero v3? Is it by pointing the repo_or_dir in this call:

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True,
                              onnx=USE_ONNX)

Thanks.

@Jeronymous
Copy link
Member

Thank you @IntendedConsequence for sharing your experience!
I want to test v3, but then I have the same question as @dgoryeo ... I don't know how to point to the v3.1 model using torch.hub.load

@Jeronymous
Copy link
Member

OK I found a way, with torch.hub.load(repo_or_dir=snakers4/silero-vad:v3.1, ...)
but there is a very inconvenient thing happening: #142 (comment)
@IntendedConsequence can you please have a look at that comment in the PR?

@IntendedConsequence
Copy link

IntendedConsequence commented Nov 19, 2023

@dgoryeo @Jeronymous I addressed your questions in the PR comment link. Copying here for context and so you don't have to pointer-chase it

I found a commit that addresses this issue in silero repository. But judging from commit dates, it seems to have been merged after default switched to v4.0? I don't know what is the best option here. I personally don't use the silero repo anymore. Because I wanted a near-instant inference start on demand (to skip non-speech in my local mpv player from any playback position) I switched to a self-contained minimal C program that calls to onnxruntime's C api in a dll. I just pipe the audio from ffmpeg and it immediately returns the timestamps. Switching silero versions for me was just a matter of renaming the model file and adjusting the onnxruntime api (the V4 IIRC changed output tensor dimension).

snakers4/silero-vad@df1d520

@dgoryeo
Copy link

dgoryeo commented Nov 19, 2023

Thanks @IntendedConsequence !

@Jeronymous
Copy link
Member

Since version 1.14.1, several VAD methods can be used.

The same default method is used if vad is True but one can specify:

  • --vad="auditok", or
  • former versions of silero, e.g. --vad="silero:3.1"

This is documented in the README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants