Possibility of forced alignment support? #52

conncomg123 · 2022-09-23T01:43:19Z

conncomg123
Sep 23, 2022

This already supports timestamps for phrases-- would it be possible for forced alignment support? That is, given an audio file and, optionally, its transcript, could it produce a file containing the timestamps of each phoneme?

jongwook · 2022-10-03T23:35:17Z

jongwook
Oct 3, 2022
Maintainer

Discussion #3 has some implementations for getting token/word-level timestamps. Getting timestamps for each phoneme would be difficult from Whisper models only, because the model is end-to-end trained to predict BPE tokens directly, which are often a full word or subword consisting of a few graphemes.

An alternative could be using an external forced alignment tool based on the outputs from a Whisper model.

0 replies

jongwook · 2023-01-04T01:27:27Z

jongwook
Jan 4, 2023
Maintainer

Update: I've made a demo of obtaining word-level timestamps using the cross-attention patterns in the multilingual ASR notebook

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of forced alignment support? #52

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Possibility of forced alignment support? #52

conncomg123 Sep 23, 2022

Replies: 2 comments

jongwook Oct 3, 2022 Maintainer

jongwook Jan 4, 2023 Maintainer

conncomg123
Sep 23, 2022

jongwook
Oct 3, 2022
Maintainer

jongwook
Jan 4, 2023
Maintainer