How to keep speech dysfluencies in the transcription ('ehm', 'uhm', fastly repeated or interrupted words ) #1174
-
Hello, I am looking for a solution to a problem related to using a podcast dataset for training a TTS (text-to-speech) model. Most of the available datasets are from audiobook readings, which results in a TTS model lacking the expressivity of spontaneous speech. To address this issue, I am planning to use Whisper to transcribe the podcasts and obtain word-level time stamps. I will also utilize speaker overlapping detection and VAD (voice activity detection) to segment the audio in a meaningful way. However, the main problem I encountered is a mismatch between what is said in the audio and the corresponding transcription. In particular, the audio frequently contains speech dysfluencies such as 'ehm', 'uhm', repeated words, or rapidly interrupted words (e.g., 'I.. I mean') that are common in spontaneous speech but not present in the transcription. While ASR (automatic speech recognition) systems automatically correct these dysfluencies to produce an intelligible output, this results in lower quality TTS data because the model is fed with "dirty" audio with respect to the transcription. To resolve this issue, I need a way to produce the desired output using Whisper or another high-quality ASR system. Alternatively, if the Whisper word time stamps are accurate enough, I could use them along with VAD to remove segments where VAD is active but does not correspond to any word segment. However, this approach seems brute force and it's just a rough idea. So, does someone know a way to achieve this with Whisper (or any other good quality ASR for that matters) or has any idea on how to go about this? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Just an idea, coming from https://platform.openai.com/docs/guides/speech-to-text/prompting:
|
Beta Was this translation helpful? Give feedback.
-
Found something else in a video about the paper, and had to come back here. You might want to take a look at this line in the normalizers: whisper/whisper/normalizers/english.py Line 467 in c09a7ae Maybe there's something in that section about your undesired repetions as well. Or @jongwook might be able to comment on that. |
Beta Was this translation helpful? Give feedback.
Just an idea, coming from https://platform.openai.com/docs/guides/speech-to-text/prompting: