Skip to content

Misspelled output #266

Answered by jongwook
gipt66 asked this question in Q&A
Oct 7, 2022 · 3 comments · 3 replies
Discussion options

You must be logged in to vote

This is unfortunately a limitation of the trained model, and our filters may not have been very effective at excluding ASR-generated transcripts for non-English languages. I suspect incorrectly spelled transcripts in the training data have probably caused this phenomenon. I've seen similar spelling errors in Korean as well, which wouldn't make sense if the training data only contained grammatically valid text in correct spelling.

Like you mentioned, integrating with an LM may improve the results; this is not directly supported, but one can extend the TokenDecoder class to select tokens according to a language model:

whisper/whisper/decoding.py

Lines 195 to 246 in 9e653bd

class T…

Replies: 3 comments 3 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@gipt66
Comment options

@R4ZZ3
Comment options

Answer selected by jongwook
Comment options

You must be logged in to vote
1 reply
@skanda1005
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
6 participants