-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833
Comments
I've run into this as well- one unblock I found (haven't tracked why this is the case), is that if you also include return_language=True in your pipe (so have both return_language=True, return_timestamps="word"), then the word level timestamps are correct / make sense. We were seeing some pretty nonsense timestamps without this, it could be the case that some other intermediate reps as needed to properly time align, and are only getting passed through when language info is being passed |
Thank you for your response.
|
also cc @ylacombe |
Any update? |
Gentle ping @sanchit-gandhi @ylacombe |
Hi @zxl777, Thanks for this issue! The provided audio is longer than 30 seconds. In this case, you can choose to:
LongForm generation gives better transcriptions than chunked generation for audios longer than 30 seconds, and not activating it might explain the decrease in performance you get with missing sentences. If you remove the pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
batch_size=4,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
) Hope it will help you! |
System Info
transformers
version: 4.40.0.dev0Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I searched for "Elon said" in the results, I got "Elon said, understand it." This is incomplete and misses an entire sentence.
Change the code to result = pipe(sample, return_timestamps=True,), then the result is "Elon said, When you struggle with a problem, that's when you," which is correct and meets expectations.
if set return_timestamps=False, # Occasionally missing sentences.
if set return_timestamps=True, # Occasionally there will be problems with repeated sentences.
Expected behavior
In the case of setting return_timestamps="word", the whisper-large-v3 model randomly misses sentences during recognition.
Everything works normally when return_timestamps=True.
The test audio comes from https://www.youtube.com/watch?v=CK_wQEX_yS8 . Different audio formats were downloaded, and the sentences that were missed vary.
I'm using the latest version available on GitHub right now, and I believe this is a bug.
The text was updated successfully, but these errors were encountered: