How to keep speech dysfluencies in the transcription ('ehm', 'uhm', fastly repeated or interrupted words ) #1174

LorenzoBrugioni · 2023-03-31T10:20:25Z

LorenzoBrugioni
Mar 31, 2023

Hello, I am looking for a solution to a problem related to using a podcast dataset for training a TTS (text-to-speech) model. Most of the available datasets are from audiobook readings, which results in a TTS model lacking the expressivity of spontaneous speech.

To address this issue, I am planning to use Whisper to transcribe the podcasts and obtain word-level time stamps. I will also utilize speaker overlapping detection and VAD (voice activity detection) to segment the audio in a meaningful way. However, the main problem I encountered is a mismatch between what is said in the audio and the corresponding transcription.

In particular, the audio frequently contains speech dysfluencies such as 'ehm', 'uhm', repeated words, or rapidly interrupted words (e.g., 'I.. I mean') that are common in spontaneous speech but not present in the transcription. While ASR (automatic speech recognition) systems automatically correct these dysfluencies to produce an intelligible output, this results in lower quality TTS data because the model is fed with "dirty" audio with respect to the transcription.

To resolve this issue, I need a way to produce the desired output using Whisper or another high-quality ASR system. Alternatively, if the Whisper word time stamps are accurate enough, I could use them along with VAD to remove segments where VAD is active but does not correspond to any word segment. However, this approach seems brute force and it's just a rough idea.

So, does someone know a way to achieve this with Whisper (or any other good quality ASR for that matters) or has any idea on how to go about this?
Any answer will be much appreciated, thank you and have a great day.

Answered by IlianP

Mar 31, 2023

Just an idea, coming from https://platform.openai.com/docs/guides/speech-to-text/prompting:

The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them:
"Umm, let me think like, hmm... Okay, here's what I'm, like, thinking."

View full answer

IlianP · 2023-03-31T14:56:36Z

IlianP
Mar 31, 2023

Just an idea, coming from https://platform.openai.com/docs/guides/speech-to-text/prompting:

The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them:
"Umm, let me think like, hmm... Okay, here's what I'm, like, thinking."

5 replies

Sulaiman281 Jun 5, 2023

I intentionally used to filler words to test it but it's filtering out and I don't want to filter them I want to count how many filler words are in the audio.

IlianP Jun 6, 2023

What prompt did you use exactly for the model? And did you look at/try out my other answer (from below)?

denisbetsi Apr 14, 2024

Use it with --initial_prompt "I'm like,you know what I mean, kind of, um, ah, huh, and so, so um, uh, and um, like um, so like, like it's, it's like, i mean, yeah, ok so, uh so, so uh, yeah so, you know, it's uh, uh and, and uh, like, kind"

denisbetsi Apr 14, 2024

also oddly enough this only works reliably on tiny model, I receive best results from --model tiny.en, going up to smell drastically reduces odds of retaining disfluencies

denisbetsi Apr 14, 2024

Oh, and there's a bug in whisper, double check your initial_prompt that leads to timecodes being way off.

--initial_prompt "I was like, was like, I'm like, you know what I mean, kind of,  um, ah, huh, and so, so um, uh, and um, like um, so like, like it's, it's like, i mean, yeah, ok so, uh so, so uh, yeah so, you know, it's uh, uh and, and uh, like, kind"

output:
[00:16.660 --> 00:29.980] uh, for uh, this meeting here, I just met with uh, Johny uh, from lion, uh, lion, uh

Contrast it with

--initial_prompt "I was like, was like, I'm like, you know what I mean, kind of, um, ah, huh, and so, so um, uh, and um, like um, so like, like it's, it's like, i mean, yeah, ok so, uh so, so uh, yeah so, you know, it's uh, uh and, and uh, like, kind"

output:
[00:01.400 --> 00:29.980] uh, for this meeting here, I just met with uh, Johny, uh, from Lion, uh

The only difference between the 2 prompts is extra space in "kind of, um" before the "um".

That extra space caused the entire timecode to shift way off target.

IlianP · 2023-04-19T13:51:15Z

IlianP
Apr 19, 2023

Found something else in a video about the paper, and had to come back here. You might want to take a look at this line in the normalizers:

whisper/whisper/normalizers/english.py

Line 467 in c09a7ae

self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b"

Maybe there's something in that section about your undesired repetions as well. Or @jongwook might be able to comment on that.

3 replies

Sara-A-Jones Aug 6, 2023

If I were to remove/modify this line from the whisper/whisper/normalizers/english.py file, is it still possible to use a pre-trained Whisper model or would I need to train Whisper from scratch?

IlianP Aug 6, 2023

Since this part of the code is only about postprocessing, I guess if you comment it out, and also comment out the line where "self.ignore_patterns" is referenced:

whisper/whisper/normalizers/english.py

Line 531 in c09a7ae

s = re.sub(self.ignore_patterns, "", s)

I don't see why this shouldn't work. Fork it and give it a try? Also, don't forget about the prompting (see above), since the model might be trained to omit filler words.

lterfloth Aug 11, 2023

I tried it without much success, sadly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to keep speech dysfluencies in the transcription ('ehm', 'uhm', fastly repeated or interrupted words ) #1174

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to keep speech dysfluencies in the transcription ('ehm', 'uhm', fastly repeated or interrupted words ) #1174

Replies: 2 comments · 8 replies

Replies: 2 comments 8 replies