Word-level timestamps? #332

eudoxos · 2022-10-16T10:02:26Z

eudoxos
Oct 16, 2022

I was trying a simple

import whisper
model=whisper.load_model("large")
result=model.transcribe("p_trim3.wav")

to see if I can locate timestamps for individual words/tokens in the result but I don't see them in the output. Is it possible to get this from the model?

I am reading these lines:

whisper/whisper/tokenizer.py

Lines 150 to 151 in d18e9ea

    
           if token >= self.timestamp_begin: 
        
               timestamp = f"<|{(token - self.timestamp_begin) * 0.02:.2f}|>"

whisper/whisper/tokenizer.py

Lines 191 to 192 in d18e9ea

    
           def timestamp_begin(self) -> int: 
        
               return self.tokenizer.all_special_ids[-1] + 1

so in this example (one segment from a test audio), the numbers marked in red should be timestamps (in samples)? There is 4 words in the segment text, but only 3 parts in-between the 4 timestamps. Any help/hint will be much appreciated.

Arlen22 · 2022-10-16T12:03:04Z

Arlen22
Oct 16, 2022

whisper/whisper/decoding.py

Line 651 in d18e9ea

tokens: List[List[int]] = [t[i].tolist() for i, t in zip(selected, tokens)]

I think one of the lines here should have what you are looking for.

0 replies

jianfch · 2022-10-16T15:03:17Z

jianfch
Oct 16, 2022

this was discussed in #3

0 replies

jongwook · 2023-01-04T01:26:28Z

jongwook
Jan 4, 2023
Maintainer

I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook

0 replies

Jeronymous · 2023-01-14T12:40:19Z

Jeronymous
Jan 14, 2023

I improved the solution of @jongwook (thanks to him for the very useful notebook!) to be able to get cross-attention weights on the fly while the model is ran on segments of audio.
So there is no need to run the model twice (twice faster). It also solves memory issues that could occur on big signals.

Also I think that in the current version of the notebook by @jongwook there is an undesired shift of one token (the cross-attention weights computed on a given input token are relevant for the prediction of the next token).

Please find my code here: https://github.com/Jeronymous/whisper-timestamped

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word-level timestamps? #332

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Word-level timestamps? #332

eudoxos Oct 16, 2022

Replies: 4 comments

Arlen22 Oct 16, 2022

jianfch Oct 16, 2022

jongwook Jan 4, 2023 Maintainer

Jeronymous Jan 14, 2023

eudoxos
Oct 16, 2022

Arlen22
Oct 16, 2022

jianfch
Oct 16, 2022

jongwook
Jan 4, 2023
Maintainer

Jeronymous
Jan 14, 2023