Getting time offsets of beginning and end of each word #3
-
Hello, I was wondering if it would be possible to get time offsets of start & end of each word / sentences as they appear in the audio. MotivationI was exploring google's https://cloud.google.com/speech-to-text/docs/async-time-offsets and thought it would be great if whisper can produce a similar dataset. |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 44 replies
-
Looks like it’s already supported. See the librispeech notebook (or colab example), it’s passed as: options = whisper.DecodingOptions(language="en", without_timestamps=True) Setting the flag False should return what you want. |
Beta Was this translation helpful? Give feedback.
-
We are sampling timestamp tokens mixed with text tokens, which provides phrase-level timestamps like:
Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights. |
Beta Was this translation helpful? Give feedback.
-
I hacked together a script today using whisper with Wav2Vec2 forced alignment (https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html) that generates word-level srt captions. Feel free to play with it and modify it, I might spend a couple more hours cleaning it up and making it more robust but I'm leaving it here for now: https://github.com/johnafish/whisperer |
Beta Was this translation helpful? Give feedback.
-
Update with full script You can actually get the timestamp prediction for each word because it's part of the predictions but it's filtered and reserved for the start time and end time tokens. That means you can clone the logits to filter it then return it along with the other results. Add those lines marked with "# <----add this" in decoding.DecodingTask._main_loop:
there a couple more methods you'll need to return through to get back up to transcribe()
|
Beta Was this translation helpful? Give feedback.
-
Came here to ask this, because the phrase-level timestamps are wildly inaccurate, at least in the Medium model I used. I tried to transcribe a podcast with three speakers, each with their own discrete audio track. So, three transcripts, which are then synced together into one script. This is the method I usually use, and it works quite well (in the other transcription service I use). You always know who is speaking, and it's transcribing from clean audio with only ever one speaker at a time. But. Currently, this is unfeasible with Whisper. There are frequent gaps of several seconds before or after the given phrase in a segment, making it impossible to know when any of the words in the segment were actually spoken, so syncing THREE cannot be done out of the box yet. |
Beta Was this translation helpful? Give feedback.
-
@q00u @Jxspa actually it can be done, see here https://github.com/m-bain/whisperX |
Beta Was this translation helpful? Give feedback.
-
In addition to what others have introduced, I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook |
Beta Was this translation helpful? Give feedback.
-
i'm still get an error : $ whisper_timestamped videoplayback.mp3 --model tiny --language it --accurate --verbose True (audio length is around 3 minutes) ... |
Beta Was this translation helpful? Give feedback.
Update with full script
https://github.com/jianfch/stable-ts
You can actually get the timestamp prediction for each word because it's part of the predictions but it's filtered and reserved for the start time and end time tokens. That means you can clone the logits to filter it then return it along with the other results.
Add those lines marked with "# <----add this" in decoding.DecodingTask._main_loop: