VAD does not handle almost complete silence #74

freddyertl · 2023-04-13T09:13:17Z

In the attached sample, there is almost perfect silence at the beginning. Still there are hallucinated words.

whisper_timestamped jon.wav --model medium.en --language en --verbose True --accurate --output_dir . --output_format txt,json --vad True --detect_disfluencies True

jon.zip

The text was updated successfully, but these errors were encountered:

Jeronymous · 2023-04-13T13:09:56Z

Unless I am missing something, there is not so much we can do about it...
SILERO VAD is wrong, it returns some speech segment on the first 5 minutes where there is indeed nothing.
Namely these segments (in seconds):

[
        {'start': 63.33, 'end': 71.646},
        {'start': 72.738, 'end': 122.942},
        {'start': 124.258, 'end': 133.502},
        {'start': 136.194, 'end': 157.406},
        {'start': 158.402, 'end': 210.75},
        {'start': 211.81, 'end': 242.558},
        {'start': 244.866, 'end': 263.294},
        {'start': 264.706, 'end': 267.966}
]

I'll check if this can be improved by tuning some parameters of the VAD.

freddyertl · 2023-04-13T14:26:00Z

I have also played with the threshold paramter, but for whatever reason it didn't solve the problem. If it cannot be improved with other parameters for Silero VAD, I would measure the energy in each segment that Silero returned as speech and remove those where the level is below a certain threshold. It's almost funny that in order to get rid of whisper hallucination we have some Silero hallucination.

Jeronymous · 2023-04-13T16:05:56Z

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below).
There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime".
I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD.
Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

freddyertl · 2023-04-13T18:28:46Z

For me it makes perfect sense to have a first pass which detects almost silence. Then the noisy parts can be processed by a model. By the way, what does the no_speech_threshold/logprob_threshold stuff? Sounds like it would also deal with silence.

Jeronymous · 2023-04-14T07:12:56Z

no_speech_threshold is used to remove segments that the Whisper model detects as silence (it has some learnt VAD capabilities). But it's tricky to use.

In my experience logprob_threshold is not doing much.

freddyertl · 2023-04-14T08:29:50Z

You mentioned that you have something working. If you like I can play with it to see if it works in other samples.

Jeronymous · 2023-04-14T14:35:36Z

You probably refer to:

A solution can be to zero out parts of the signal that are "almost zero for sometime".
I tested, it works. But it's awkward to make that a general solution.

I just meant I manually zeroed out the first 5 minutes of audio, knowing that it was a "almost zero" part.
And it seemed to solve the issue.
I could build a more general solution, to zero out the "almost zero" parts in general, but I find it a bit awkward...

Jeronymous · 2023-04-14T16:05:36Z

First something important I forgot to mention in that thread: you can use option --plot to plot the results of the VAD
(it will also plot alignment results segment by segment, so this is for debug, you might want to run it and stop it at one point).

I created a branch with an attempt to integrate auditok instead of silero VAD.
The branch is called feature/auditok_vad.
@freddyertl You can play with this if you want, and post your comment here, or on this pull request: https://github.com/linto-ai/whisper-timestamped/tree/feature/auditok_vad

freddyertl · 2023-04-14T16:16:34Z

Thanks, great that you could do it so quickly. I did a first round of testing and it produces correct result where Silero VAD had problems. It seems that this energy-based approach is a better fit with whisper because we don't have a noise but a silence problem. I will feed in more samples.

traidn · 2023-04-17T10:57:53Z

I also came across a bad VAD prediction on a completely silent recording. Maybe this problem can be partially solved with Pydub silent detection? I think this may be the first operation before VAD. And eventually combine the time intervals from this module with the VAD predictions.

Jeronymous · 2023-04-17T20:14:09Z

Thanks @traidn for spotting another VAD method.
If you have a chance, when you see VAD issues, you can maybe try the feature/auditok_vad branch of whisper-timestamped.

dgoryeo · 2023-04-19T21:09:36Z

Hi @Jeronymous and @freddyertl , I'm getting an error:

AttributeError: module 'auditok' has no attribute 'split'

Have you come across similar error by any chance?

Jeronymous · 2023-04-19T22:22:52Z

No, I have version 0.2.0 of auditok.
what says pip show auditok for you?

dgoryeo · 2023-04-19T22:30:59Z

That was it -- I upgraded to 0.2.0 and it went through. Thanks!

dgoryeo · 2023-04-20T17:13:15Z

Is there a way for me to verify which branch my whisper_timestamped is installed from? I believe I have finalised the installation from auditok branch but just need to make sure. Thanks!

Jeronymous · 2023-04-21T06:05:08Z

You can call whisper_timestamped --version (or in python whisper_timestamped.__version__).
If it's 1.12.17 you're on the auditok branch

IntendedConsequence · 2023-11-18T11:24:42Z

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

This looks to me like the exact same issue I encountered when I upgraded from silero-vad V3 to V4. Went back to V3 and no problems since then. Context: used it to remove non-spoken parts from thousands of different podcasts, stream vods and youtube audio over the last few years to listen on the go.

dgoryeo · 2023-11-18T17:23:17Z

@IntendedConsequence how do you go back to Silero v3? Is it by pointing the repo_or_dir in this call:

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True,
                              onnx=USE_ONNX)

Thanks.

Jeronymous · 2023-11-18T18:10:50Z

Thank you @IntendedConsequence for sharing your experience!
I want to test v3, but then I have the same question as @dgoryeo ... I don't know how to point to the v3.1 model using torch.hub.load

Jeronymous · 2023-11-18T19:01:05Z

OK I found a way, with torch.hub.load(repo_or_dir=snakers4/silero-vad:v3.1, ...)
but there is a very inconvenient thing happening: #142 (comment)
@IntendedConsequence can you please have a look at that comment in the PR?

IntendedConsequence · 2023-11-19T05:34:25Z

@dgoryeo @Jeronymous I addressed your questions in the PR comment link. Copying here for context and so you don't have to pointer-chase it

I found a commit that addresses this issue in silero repository. But judging from commit dates, it seems to have been merged after default switched to v4.0? I don't know what is the best option here. I personally don't use the silero repo anymore. Because I wanted a near-instant inference start on demand (to skip non-speech in my local mpv player from any playback position) I switched to a self-contained minimal C program that calls to onnxruntime's C api in a dll. I just pipe the audio from ffmpeg and it immediately returns the timestamps. Switching silero versions for me was just a matter of renaming the model file and adjusting the onnxruntime api (the V4 IIRC changed output tensor dimension).

snakers4/silero-vad@df1d520

dgoryeo · 2023-11-19T16:48:52Z

Thanks @IntendedConsequence !

Jeronymous · 2023-11-30T22:36:03Z

Since version 1.14.1, several VAD methods can be used.

The same default method is used if vad is True but one can specify:

--vad="auditok", or
former versions of silero, e.g. --vad="silero:3.1"

This is documented in the README

Jeronymous mentioned this issue Apr 24, 2023

Can't install on M2 Mac #83

Closed

Jeronymous mentioned this issue Nov 15, 2023

Try VAD with auditok #78

Closed

Jeronymous added the bug Something isn't working label Nov 15, 2023

Jeronymous mentioned this issue Nov 18, 2023

Use silero v3.1 #142

Merged

Jeronymous mentioned this issue Nov 18, 2023

Bug report - Regression of VAD quality between 3.1 and 4.0 (speech detected on perfect silence) snakers4/silero-vad#396

Closed

Jeronymous closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD does not handle almost complete silence #74

VAD does not handle almost complete silence #74

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 13, 2023

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 13, 2023

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 14, 2023

freddyertl commented Apr 14, 2023

Jeronymous commented Apr 14, 2023

Jeronymous commented Apr 14, 2023

freddyertl commented Apr 14, 2023 •

edited

Loading

traidn commented Apr 17, 2023

Jeronymous commented Apr 17, 2023

dgoryeo commented Apr 19, 2023

Jeronymous commented Apr 19, 2023

dgoryeo commented Apr 19, 2023

dgoryeo commented Apr 20, 2023

Jeronymous commented Apr 21, 2023

IntendedConsequence commented Nov 18, 2023

dgoryeo commented Nov 18, 2023

Jeronymous commented Nov 18, 2023

Jeronymous commented Nov 18, 2023

IntendedConsequence commented Nov 19, 2023 •

edited

Loading

dgoryeo commented Nov 19, 2023

Jeronymous commented Nov 30, 2023

VAD does not handle almost complete silence #74

VAD does not handle almost complete silence #74

Comments

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 13, 2023

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 13, 2023

freddyertl commented Apr 13, 2023

Jeronymous commented Apr 14, 2023

freddyertl commented Apr 14, 2023

Jeronymous commented Apr 14, 2023

Jeronymous commented Apr 14, 2023

freddyertl commented Apr 14, 2023 • edited Loading

traidn commented Apr 17, 2023

Jeronymous commented Apr 17, 2023

dgoryeo commented Apr 19, 2023

Jeronymous commented Apr 19, 2023

dgoryeo commented Apr 19, 2023

dgoryeo commented Apr 20, 2023

Jeronymous commented Apr 21, 2023

IntendedConsequence commented Nov 18, 2023

dgoryeo commented Nov 18, 2023

Jeronymous commented Nov 18, 2023

Jeronymous commented Nov 18, 2023

IntendedConsequence commented Nov 19, 2023 • edited Loading

dgoryeo commented Nov 19, 2023

Jeronymous commented Nov 30, 2023

freddyertl commented Apr 14, 2023 •

edited

Loading

IntendedConsequence commented Nov 19, 2023 •

edited

Loading