Getting time offsets of beginning and end of each word #3

shashanoid · 2022-09-21T17:43:03Z

shashanoid
Sep 21, 2022

Hello, I was wondering if it would be possible to get time offsets of start & end of each word / sentences as they appear in the audio.

Motivation

I was exploring google's https://cloud.google.com/speech-to-text/docs/async-time-offsets and thought it would be great if whisper can produce a similar dataset.

Answered by jianfch

Sep 26, 2022

Update with full script
https://github.com/jianfch/stable-ts

You can actually get the timestamp prediction for each word because it's part of the predictions but it's filtered and reserved for the start time and end time tokens. That means you can clone the logits to filter it then return it along with the other results.

Add those lines marked with "# <----add this" in decoding.DecodingTask._main_loop:

    def _main_loop(self, audio_features: Tensor, tokens: Tensor):
        assert audio_features.shape[0] == tokens.shape[0]
        n_batch = tokens.shape[0]
        sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
        no_caption_probs = [np.nan] * n_batch

    …

View full answer

aliJabra · 2022-09-21T19:02:45Z

aliJabra
Sep 21, 2022

Looks like it’s already supported. See the librispeech notebook (or colab example), it’s passed as:

options = whisper.DecodingOptions(language="en", without_timestamps=True)

Setting the flag False should return what you want.

5 replies

R4ZZ3 Sep 21, 2022

Great! This was one question I also had

R4ZZ3 Sep 21, 2022

But actually it seems it is just using those to boost performance somehow but not actually returning the timestamps. Or at least I could not figure out how to get those

jongwook Sep 21, 2022
Maintainer

Could you check if transcription["segments"] has the timestamps? Currently the implementation produces phrase-level timestamps.

aliJabra Sep 21, 2022

Yes, in transcribe.py:

Returns

A dictionary containing the resulting text ("text") and segment-level details ("segments"), and
the spoken language ("language"), which is detected when decode_options["language"] is None.
"""

On a side note, how is the performance in Finnish? Looks like transcription is pretty close from your screenshot.

R4ZZ3 Sep 21, 2022

Thanks @jongwook,
I was able to retrieve segments.
However, one note. My audio is only 9 seconds long. I think it would I quess make more sence to calculate the duration of the last segment without padding?

@aliJabra
The one below is not actual transcription.
It is the output with without_timestamps=False. But so far seems somewhat promising, especially if I can finetune this more and then run some benchmarks against our wav2vec2 models (Some common voice benchmarks here ) https://huggingface.co/Finnish-NLP/wav2vec2-base-fi-voxpopuli-v2-finetuned

jongwook · 2022-09-21T19:21:30Z

jongwook
Sep 21, 2022
Maintainer

We are sampling timestamp tokens mixed with text tokens, which provides phrase-level timestamps like:

...
[08:23.000 --> 08:26.000]  I mean, you know, it will break you down.
[08:26.000 --> 08:29.000]  Beat you up and smash you to the floor.
[08:29.000 --> 08:32.000]  You know, but you just got to persevere and fight through it.
[08:32.000 --> 08:35.000]  And I was close many times to say, why am I doing this?
[08:35.000 --> 08:37.000]  It makes no sense, you know.
[08:37.000 --> 08:41.000]  I guess if you can do it, though, and it's available to you,
[08:41.000 --> 08:42.000]  you're going to, right?
[08:42.000 --> 08:43.000]  Yeah.
...

Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights.

3 replies

usergit Sep 21, 2022

this would be the single biggest addition to this repo, it could be transformative for audio editing, podcasting etc... I know It's a big ask, but if you could provide a code example, it doesn't even have to be working, it could even be a pseudo code, It would be great.

lectair Sep 25, 2022

I've written a small script that converts the output to an SRT file. It is useful for getting subtitles in a universal format for any audio:

from datetime import timedelta
import os
import whisper

def transcribe_audio(path):
    model = whisper.load_model("base") # Change this to your desired model
    print("Whisper model loaded.")
    transcribe = model.transcribe(audio=path)
    segments = transcribe['segments']

    for segment in segments:
        startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
        endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
        text = segment['text']
        segmentId = segment['id']+1
        segment = f"{segmentId}\n{startTime} --> {endTime}\n{text[1:] if text[0] is ' ' else text}\n\n"

        srtFilename = os.path.join("SrtFiles", f"VIDEO_FILENAME.srt")
        with open(srtFilename, 'a', encoding='utf-8') as srtFile:
            srtFile.write(segment)

    return srtFilename

Sample SRT File:

1
00:00:00,000 --> 00:00:05,000
Open AI has recently decided to open source.

2
00:00:05,000 --> 00:00:09,000
Their translation and transcription AI whisper.

3
00:00:09,000 --> 00:00:18,000
So now it is under an MIT license and that includes both the code that's here as well as the model weights that were used to train the AI.

4
00:00:18,000 --> 00:00:26,000
So if you want it to go and try and make your own speech transcription AI with that data, you are free to do so.

RaulKite Sep 25, 2022

That's not the same.

It will really useful to have the start and end of each word not sentences.

johnafish · 2022-09-26T01:00:33Z

johnafish
Sep 26, 2022

I hacked together a script today using whisper with Wav2Vec2 forced alignment (https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html) that generates word-level srt captions. Feel free to play with it and modify it, I might spend a couple more hours cleaning it up and making it more robust but I'm leaving it here for now: https://github.com/johnafish/whisperer

0 replies

jianfch · 2022-09-26T06:05:50Z

jianfch
Sep 26, 2022

Update with full script
https://github.com/jianfch/stable-ts

You can actually get the timestamp prediction for each word because it's part of the predictions but it's filtered and reserved for the start time and end time tokens. That means you can clone the logits to filter it then return it along with the other results.

Add those lines marked with "# <----add this" in decoding.DecodingTask._main_loop:

    def _main_loop(self, audio_features: Tensor, tokens: Tensor):
        assert audio_features.shape[0] == tokens.shape[0]
        n_batch = tokens.shape[0]
        sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
        no_caption_probs = [np.nan] * n_batch

        ts_tokens = None  # <----add this
        try:
            for i in range(self.sample_len):
                logits = self.inference.logits(tokens, audio_features)

                if i == 0 and self.tokenizer.no_captions is not None:  # save no_caption_probs
                    probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
                    no_caption_probs = probs_at_sot[:, self.tokenizer.no_captions].tolist()

                # now we need to consider the logits at the last token only
                logits = logits[:, -1]

                logits_clone = torch.clone(logits)  # <----add this
                for k in range(tokens.shape[0]):  # <----add this
                    logits_clone[k, : self.tokenizer.timestamp_begin] = -np.inf  # <----add this
                ts_token = torch.argmax(logits_clone,  dim=-1)[:, None]  # <----add this
                ts_tokens = ts_token if ts_tokens is None else torch.cat([ts_tokens, ts_token], dim=-1)  # <----add this
                del logits_clone  # <----add this

                # apply the logit filters, e.g. for suppressing or applying penalty to
                for logit_filter in self.logit_filters:
                    logit_filter.apply(logits, tokens)

                # expand the tokens tensor with the selected next tokens
                tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)

                if completed or tokens.shape[-1] > self.n_ctx:
                    break
        finally:
            self.inference.cleanup_caching()

        return tokens, sum_logprobs, no_caption_probs, ts_tokens  # <----add this

there a couple more methods you'll need to return through to get back up to transcribe()
which is where the calculation for the timestamps occur
time_of_a_word = timestamp_offset + (timestamp_token - tokenizer.timestamp_begin) * time_precision

*note: i haven't full test this yet. just wrote this when i was debugging. but i will do this when i have more time

27 replies

eschmidbauer Sep 30, 2022

wow great improvement in latest changes. here is some example output

I still see some timestamp overlaps though:

yet the unstable timestamps seem to have more accurate "selections" available

Im using MFA to compare stable-ts outputs to

We aren't looking for precision - just need correct ordering. Wondering if you have any tips or ideas to prevent the overlapping timestamps? Really appreciate your work BTW!

q00u Sep 30, 2022

We aren't looking for precision

Well, I am definitely looking for precision, because I want the timings to be consistent between transcripts.

jianfch Sep 30, 2022

We aren't looking for precision - just need correct ordering. Wondering if you have any tips or ideas to prevent the overlapping timestamps? Really appreciate your work BTW!

you can try the following:

from stable_whisper import stabilize_timestamps

#with the modified model
# ts_num=7 is num of timestamps to get, so 7 is more than the default of 5
# stab=True  is disable stabilization so you can do it later with different settings
result = model.transcribe('audio.mp3', ts_num=7, stab=False) 
stab_segments = stabilize_timestamps(result ['segments'], aggressive=True ) # aggressive allows more variation

if this does reduces the overlapping, it might do it at the cost of accuracy and precision

but I'm working on another stabilization algorithm to put less emphasis on "top prediction" and use more of the unstable_timestamps.
On top of that, I'll test functionality to use results from multiple inferences with varying predictions to aggregate the best prediction.
It's kind of like mixture of ensemble boosting and bagging style but for 1 model. It stems from how the same model can produce different but sometimes slightly more accurate predictions by introducing some noise to the input, but I think this benefits translation and possibly timestamps more than transcription because it's already near human-level for English

doesdev Sep 30, 2022

Sorry for adding noise here, but I just want to say thanks for the work you are doing on this Jian. It is greatly appreciated!

jianfch Oct 1, 2022

Added new method for stabilizing token timestamps and made it the default.
old method -> new method

the new method does not adhere the top prediction, instead it looks at everything in unstable_word_timestamps
if you prefer the old method:

stab_segments = stabilize_timestamps(result, top_focus=True)

or play around with the ts_num for model.transcribe() and the other params in stabilize_timestamps() to get other results

q00u · 2022-09-27T07:44:07Z

q00u
Sep 27, 2022

Came here to ask this, because the phrase-level timestamps are wildly inaccurate, at least in the Medium model I used. I tried to transcribe a podcast with three speakers, each with their own discrete audio track.

So, three transcripts, which are then synced together into one script. This is the method I usually use, and it works quite well (in the other transcription service I use). You always know who is speaking, and it's transcribing from clean audio with only ever one speaker at a time.

But.

Currently, this is unfeasible with Whisper. There are frequent gaps of several seconds before or after the given phrase in a segment, making it impossible to know when any of the words in the segment were actually spoken, so syncing THREE cannot be done out of the box yet.

1 reply

Jxspa Sep 27, 2022

I've also found the timestamps to be extremely inaccurate (small/medium.en/medium model). They're regularly out by many seconds, and I've even had a few instances of them going backwards > 03:54.000 --> 03:34.720.

m-bain · 2022-12-21T12:35:11Z

m-bain
Dec 21, 2022

@q00u @Jxspa actually it can be done, see here https://github.com/m-bain/whisperX

0 replies

jongwook · 2023-01-04T01:26:59Z

jongwook
Jan 4, 2023
Maintainer

In addition to what others have introduced, I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook

8 replies

ryanheise Jan 5, 2023

The notebook is limited to 30 seconds since it applies the model directly which has a 30 second limit. Integrating the solution into whisper would solve that problem, because whisper contains code that splits long audio into 30 second segments, runs the model on consecutive segments, and stitches the results together.

If there is a plan to integrate this into whisper, you could wait for that.

If there is no plan to integrate this into whisper, then you'd have to DIY, but for some guidance, my question would be to understand more about how these hooks should be done when integrating into whisper, particularly in the case of beam search. The demo doesn't seem to account for the possibility that different tokens could have been generated on each run, and the timestamps chosen should be the ones associated with the particular run selected in the beam search.

mu4farooqi Jan 5, 2023

Until Whisper merges that code to its main repository. I did it in my own branch. You can install and try.

https://github.com/mu4farooqi/whisper/tree/word_level_ts

Jeronymous Jan 14, 2023

I improved the solution of @jongwook (thanks to him for the very useful notebook!) to be able to get cross-attention weights on the fly while the model is ran on segments of audio.
So there is no need to run the model twice (twice faster). It also solves memory issues that could occur on big signals, which @eschmidbauer reports.

Also I think that in the current version of the notebook by @jongwook there is an undesired shift of one token (the cross-attention weights computed on a given input token are relevant for the prediction of the next token).

Please find my code here: https://github.com/Jeronymous/whisper-timestamped

Any feedback welcome on this repo

bfeist Jan 14, 2023

I improved the solution of @jongwook (thanks to him for the very useful notebook!) to be able to get cross-attention weights on the fly while the model is ran on segments of audio. So there is no need to run the model twice (twice faster). It also solves memory issues that could occur on big signals, which @eschmidbauer reports.

Also I think that in the current version of the notebook by @jongwook there is an undesired shift of one token (the cross-attention weights computed on a given input token are relevant for the prediction of the next token).

Please find my code here: https://github.com/Jeronymous/whisper-timestamped

Any feedback welcome on this repo

Thanks very much for this. Could you possibly provide some notes on how to execute your repo from the command line?

Jeronymous Jan 15, 2023

Sure. I completed the README. It should be clearer now in github repo.
Basically, there is a CLI "whisper_timestamped" that works similarly as the "whisper" CLI (it will produce more things, and does not support some options related to beam search)

Gigio2k · 2023-03-31T07:52:47Z

Gigio2k
Mar 31, 2023

i'm still get an error :

$ whisper_timestamped videoplayback.mp3 --model tiny --language it --accurate --verbose True

(audio length is around 3 minutes)

...
[02:09.500 --> 02:10.340] Tolfika,
[02:10.720 --> 02:11.040] perdere.
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/whisper_timestamped", line 8, in
sys.exit(cli())
File "/home/ubuntu/.local/lib/python3.10/site-packages/whisper_timestamped/transcribe.py", line 2156, in cli
result = transcribe_timestamped(
File "/home/ubuntu/.local/lib/python3.10/site-packages/whisper_timestamped/transcribe.py", line 253, in transcribe_timestamped
(transcription, words) = _transcribe_timestamped_naive(model, audio,
File "/home/ubuntu/.local/lib/python3.10/site-packages/whisper_timestamped/transcribe.py", line 1165, in _transcribe_timestamped_naive
assert len(segment_tokens_check) < len(segment["tokens"]) and segment_tokens_check[:-1] == segment["tokens"][:len(segment_tokens_check)-1],
AssertionError: Got inconsistent tokens: != �a, joJo, Bachelor.

0 replies

Getting time offsets of beginning and end of each word #3

Motivation

Replies: 8 comments · 44 replies

jongwook Sep 21, 2022 Maintainer

Returns

jongwook Sep 21, 2022 Maintainer

jongwook Jan 4, 2023 Maintainer

Replies: 8 comments 44 replies

jongwook Sep 21, 2022
Maintainer

jongwook
Sep 21, 2022
Maintainer

jongwook
Jan 4, 2023
Maintainer