-
After generating a VTT file, the subtitles sometimes get hastened, and scroll by much faster than the actual audio. They usually get fixed within about 30 seconds (after one chunk of audio is processed), but if the audio is long, they start scrolling faster again after a short duration. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 12 replies
-
I've personally observed this behavior more on the |
Beta Was this translation helpful? Give feedback.
-
This is happening to me as well, using the Medium model, and generating a VTT for a video in Brazilian Portuguese. |
Beta Was this translation helpful? Give feedback.
-
This is one of the failure mode of the hacky long-form heuristics (in Lines 220 to 222 in 2d3032d and you can modify this block to always reset the context to mitigate the tendency of going out of sync. In the paper, we have seen a slight improvement when using this previous-text conditioning than not: but it didn't always help (WER increased in TED-LIUM3). It might be worth making this previous-text conditioning as an optional flag, if the failure case is common in practice. EDIT: just added |
Beta Was this translation helpful? Give feedback.
This is one of the failure mode of the hacky long-form heuristics (in
transcribe.py
and discussed in Section 4.5), where the timestamp offsets sometimes accumulate over time, because the transcription from the previous 30-second window including the timestamps are fed to the model as conditioning input. This is currently controlled by a currently hard-coded constant here:whisper/whisper/transcribe.py
Lines 220 to 222 in 2d3032d
and you can modify this block to always reset the context to mitigate the tendency of going out of sync.
In…