Subtitles sometimes go out of sync #89

athu16 · 2022-09-24T05:29:37Z

athu16
Sep 24, 2022

After generating a VTT file, the subtitles sometimes get hastened, and scroll by much faster than the actual audio. They usually get fixed within about 30 seconds (after one chunk of audio is processed), but if the audio is long, they start scrolling faster again after a short duration.
I've tried English, Russian and Marathi translation, and this problem usually occurs with Marathi which has a much higher WER %. I can understand the accuracy of translation being low, but I can't understand how a higher WER % could affect the timestamps.

Answered by jongwook

Sep 25, 2022

This is one of the failure mode of the hacky long-form heuristics (in transcribe.py and discussed in Section 4.5), where the timestamp offsets sometimes accumulate over time, because the transcription from the previous 30-second window including the timestamps are fed to the model as conditioning input. This is currently controlled by a currently hard-coded constant here:

whisper/whisper/transcribe.py

Lines 220 to 222 in 2d3032d

     if result.temperature > 0.5:  
   # do not feed the prompt tokens if a high temperature was used  
   prompt_reset_since = len(all_tokens)  

 

and you can modify this block to always reset the context to mitigate the tendency of going out of sync.

In…

View full answer

coder543 · 2022-09-24T14:22:54Z

coder543
Sep 24, 2022

I've personally observed this behavior more on the large model than on any of the other models.

1 reply

turnkit Oct 19, 2022

I had a 90 mp3 file go several minutes out of sync in large. I had switched to small mp3 and the issues went away but I sure would like to return to the large model with confidence when a get a GPU. I had assumed the issue was with mp3 time code tracking issue, so I hope it isn't somehow a large model problem.

Xlindvain · 2022-09-25T00:12:55Z

Xlindvain
Sep 25, 2022

This is happening to me as well, using the Medium model, and generating a VTT for a video in Brazilian Portuguese.

0 replies

jongwook · 2022-09-25T11:18:18Z

jongwook
Sep 25, 2022
Maintainer

This is one of the failure mode of the hacky long-form heuristics (in transcribe.py and discussed in Section 4.5), where the timestamp offsets sometimes accumulate over time, because the transcription from the previous 30-second window including the timestamps are fed to the model as conditioning input. This is currently controlled by a currently hard-coded constant here:

whisper/whisper/transcribe.py

Lines 220 to 222 in 2d3032d

    
           if result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

and you can modify this block to always reset the context to mitigate the tendency of going out of sync.

In the paper, we have seen a slight improvement when using this previous-text conditioning than not:

but it didn't always help (WER increased in TED-LIUM3). It might be worth making this previous-text conditioning as an optional flag, if the failure case is common in practice.

EDIT: just added --condition_on_previous_text False option. It may repeat short words across the boundaries but the timestamps should work better.

11 replies

jltchiu Sep 26, 2022

@flesnuk Thanks! This clarified my question! Really appreciate it.

akashmjn Oct 8, 2022

@jongwook - appreciate the detailed response! Just to be super clear, in Table 7 - does this shows the impact of setting --condition_on_previous_text=True or is it the impact of making the heuristic intervention you linked?

whisper/whisper/transcribe.py

Lines 220 to 222 in 2d3032d

    
           if result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

A bit confused by the complex semantics 😄
In the paper, we have seen a slight improvement when using this previous-text conditioning than not
Providing the transcribed text from the preceding window as previous-text conditioning when the applied temperature is below 0.5 further improves the performance

Thanks!

jongwook Oct 9, 2022
Maintainer

yes, after this thread I added the condition, and the three lines above now looks like this:

whisper/whisper/transcribe.py

Lines 235 to 237 in 0b1ba3d

    
           if not condition_on_previous_text or result.temperature > 0.5: 
        
               # do not feed the prompt tokens if a high temperature was used 
        
               prompt_reset_since = len(all_tokens)

The row "Previous-text conditioning" in Table 7 and below used the setting equivalent to --condition_on_previous_text=True (i.e. with the hard-coded 0.5 threshold), and the three rows above it used what's equivalent to --condition_on_previous_text=False.

coder543 Oct 19, 2022

As an interesting failure mode, attempting to transcribe this with the large model just resulted in the letter a getting printed repeatedly for over an hour of transcript. The medium.en model worked fine.

I found that setting --temperature_increment_on_fallback None sort of fixed the large model, and I wonder if this might have been caused by the 7 minutes of instrumental music at the beginning of the video, which might have caused repeated transcription failures.

Either way, I'm glad the temperature_increment_on_fallback option exists.

EDIT: on the whole, I think medium.en did better on this particular transcription anyways.

EDIT 2: It is amazing how much difference skipping the intro music makes for Whisper.

dillfrescott Oct 20, 2022

I've also noticed that sometimes the smaller than large models seem to do just as good of a job if not better in many scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtitles sometimes go out of sync #89

{{title}}

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

	if result.temperature > 0.5:
	# do not feed the prompt tokens if a high temperature was used
	prompt_reset_since = len(all_tokens)

Subtitles sometimes go out of sync #89

Replies: 3 comments · 12 replies

jongwook Sep 25, 2022 Maintainer

jongwook Oct 9, 2022 Maintainer

Replies: 3 comments 12 replies

jongwook
Sep 25, 2022
Maintainer

jongwook Oct 9, 2022
Maintainer