The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833

zxl777 · 2024-03-23T16:05:41Z

System Info

transformers version: 4.40.0.dev0
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sanchit-gandhi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Download audio from https://www.youtube.com/watch?v=CK_wQEX_yS8

python3 -m pip install -U yt-dlp[default]
yt-dlp -f 'bestaudio[ext=webm]' -o audio.webm "https://www.youtube.com/watch?v=CK_wQEX_yS8"
yt-dlp -f 'bestaudio[ext=m4a]' -o audio.m4a "https://www.youtube.com/watch?v=CK_wQEX_yS8"

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=4,
    return_timestamps=True, 
    torch_dtype=torch_dtype,

    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
# sample = dataset[0]["audio"]

sample = 'audio.webm'
# result = pipe(sample,return_timestamps=True,)
result = pipe(sample,return_timestamps="word",)

print('== '*10)
print(result)

When I searched for "Elon said" in the results, I got "Elon said, understand it." This is incomplete and misses an entire sentence.
Change the code to result = pipe(sample, return_timestamps=True,), then the result is "Elon said, When you struggle with a problem, that's when you," which is correct and meets expectations.
if set return_timestamps=False, # Occasionally missing sentences.
if set return_timestamps=True, # Occasionally there will be problems with repeated sentences.

Expected behavior

In the case of setting return_timestamps="word", the whisper-large-v3 model randomly misses sentences during recognition.

Everything works normally when return_timestamps=True.

The test audio comes from https://www.youtube.com/watch?v=CK_wQEX_yS8 . Different audio formats were downloaded, and the sentences that were missed vary.

I'm using the latest version available on GitHub right now, and I believe this is a bug.

The text was updated successfully, but these errors were encountered:

naveen-corpusant · 2024-03-23T17:05:37Z

I've run into this as well- one unblock I found (haven't tracked why this is the case), is that if you also include return_language=True in your pipe (so have both return_language=True, return_timestamps="word"), then the word level timestamps are correct / make sense. We were seeing some pretty nonsense timestamps without this, it could be the case that some other intermediate reps as needed to properly time align, and are only getting passed through when language info is being passed

zxl777 · 2024-03-23T17:51:38Z

Thank you for your response.
Even after I added return_language=True, the issue still persists.
This parameter does not affect the problem I've encountered.

I've run into this as well- one unblock I found (haven't tracked why this is the case), is that if you also include return_language=True in your pipe (so have both return_language=True, return_timestamps="word"), then the word level timestamps are correct / make sense. We were seeing some pretty nonsense timestamps without this, it could be the case that some other intermediate reps as needed to properly time align, and are only getting passed through when language info is being passed

amyeroberts · 2024-03-23T19:07:12Z

also cc @ylacombe

zxl777 · 2024-04-09T18:43:28Z

Any update?

amyeroberts · 2024-05-07T09:42:17Z

Gentle ping @sanchit-gandhi @ylacombe

kamilakesbi · 2024-05-20T09:39:31Z

Hi @zxl777,

Thanks for this issue!

The provided audio is longer than 30 seconds. In this case, you can choose to:

Use batched inference by chunking the input audio and using chunk_length_s. It will activate short-form generation.
Automatically use sequential inference with long-form generation if you remove chunk_length_s.

LongForm generation gives better transcriptions than chunked generation for audios longer than 30 seconds, and not activating it might explain the decrease in performance you get with missing sentences.

If you remove the chunk_length_s parameter when instantiating the pipeline, you should get the same output with return_timestamps="word", return_timestamps=False and return_timestamps=True and no missing sentence. Here's the updated pipeline you should use:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    batch_size=4,
    return_timestamps=True, 
    torch_dtype=torch_dtype,
    device=device,
)

Hope it will help you!

cc @sanchit-gandhi

amyeroberts added Core: Pipeline Internals of the library; Pipeline. Audio labels Mar 23, 2024

huggingface deleted a comment from github-actions bot May 7, 2024

kamilakesbi self-assigned this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833

zxl777 commented Mar 23, 2024 •

edited

naveen-corpusant commented Mar 23, 2024

zxl777 commented Mar 23, 2024

amyeroberts commented Mar 23, 2024

zxl777 commented Apr 9, 2024

amyeroberts commented May 7, 2024

kamilakesbi commented May 20, 2024 •

edited

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word" #29833

Comments

zxl777 commented Mar 23, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

naveen-corpusant commented Mar 23, 2024

zxl777 commented Mar 23, 2024

amyeroberts commented Mar 23, 2024

zxl777 commented Apr 9, 2024

amyeroberts commented May 7, 2024

kamilakesbi commented May 20, 2024 • edited

zxl777 commented Mar 23, 2024 •

edited

kamilakesbi commented May 20, 2024 •

edited