-
Right now we get a range printed , eg: |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 8 replies
-
It will be a killer option to use in may many cases |
Beta Was this translation helpful? Give feedback.
-
(Duplicate of #3) Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights. Currently, the predicted timestamps tend to be biased towards integers, and there are some failure modes where the timestamps can be constantly shifted, making reliable word-level timestamp prediction difficult. Once this is solved by us or the community, I agree that it'd be a great addition to this repo. |
Beta Was this translation helpful? Give feedback.
-
Could something like this work? https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition/speech-to-text-align.html |
Beta Was this translation helpful? Give feedback.
-
I agree that such a feature would be tremendously awesome! |
Beta Was this translation helpful? Give feedback.
-
I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook |
Beta Was this translation helpful? Give feedback.
(Duplicate of #3) Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights.
Currently, the predicted timestamps tend to be biased towards integers, and there are some failure modes where the timestamps can be constantly shifted, making reliable word-level timestamp prediction difficult. Once this is solved by us or the community, I agree that it'd be a great addition to this repo.