When does Whisper decide to split the input audio ? #629
Unanswered
Ca-ressemble-a-du-fake
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
whisper is context aware so feeding entire audio is better. merge audio as single and provide that way |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I need to split a big audio file in small chunks ranging from 1 to 10 s. I first tried to apply VAD on the big audio, normalize to -27dB each chunk, and then call Whisper on each chunk. The transcription was good but with some errors.
I noticed that if I feed Whisper with the whole big audio file the transcription was much better (although the timestamps were rounded to second so the chunk generation was not as good but it can be remedied via stable_ts patch).
Now I have a speech in noisy environment that I want to transcribe. So first I denoise it and then I forward the denoised audio to Whisper. The results are not as good as with an audio without background noise. Therefore I would like to test whether amplifying each chunk before providing it to Whisper could improve the overall transcription quality.
That's why I would like to split the denoised big audio file at the same timestamps as would Whisper do, then normalize the produced chunks and then feed them to Whisper.
I thought Whisper would output 30 s long timestamps (as it pads or trims the audio input to 30 s chunks) but this is not the case. Neither does it provides sentence bound timestamps.
So my question is how can I split the input audio as would Whipser do ?
Hope my question is clear 😃
Thanks in advance for your help
Beta Was this translation helpful? Give feedback.
All reactions