Replies: 4 comments
-
Line 651 in d18e9ea I think one of the lines here should have what you are looking for. |
Beta Was this translation helpful? Give feedback.
-
this was discussed in #3 |
Beta Was this translation helpful? Give feedback.
-
I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook |
Beta Was this translation helpful? Give feedback.
-
I improved the solution of @jongwook (thanks to him for the very useful notebook!) to be able to get cross-attention weights on the fly while the model is ran on segments of audio. Also I think that in the current version of the notebook by @jongwook there is an undesired shift of one token (the cross-attention weights computed on a given input token are relevant for the prediction of the next token). Please find my code here: https://github.com/Jeronymous/whisper-timestamped |
Beta Was this translation helpful? Give feedback.
-
I was trying a simple
to see if I can locate timestamps for individual words/tokens in the result but I don't see them in the output. Is it possible to get this from the model?
I am reading these lines:
whisper/whisper/tokenizer.py
Lines 150 to 151 in d18e9ea
whisper/whisper/tokenizer.py
Lines 191 to 192 in d18e9ea
so in this example (one segment from a test audio), the numbers marked in red should be timestamps (in samples)? There is 4 words in the segment text, but only 3 parts in-between the 4 timestamps. Any help/hint will be much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions