Possibility of forced alignment support? #52
conncomg123
started this conversation in
Ideas
Replies: 2 comments
-
Discussion #3 has some implementations for getting token/word-level timestamps. Getting timestamps for each phoneme would be difficult from Whisper models only, because the model is end-to-end trained to predict BPE tokens directly, which are often a full word or subword consisting of a few graphemes. An alternative could be using an external forced alignment tool based on the outputs from a Whisper model. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Update: I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This already supports timestamps for phrases-- would it be possible for forced alignment support? That is, given an audio file and, optionally, its transcript, could it produce a file containing the timestamps of each phoneme?
Beta Was this translation helpful? Give feedback.
All reactions