AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

Hubert-Bonisseur · 2024-04-09T16:04:38Z

System Info

At present, the AutomaticSpeechRecognition pipeline offers the capability to predict timestamps either at the word level through cross attention or by utilizing timestamps tokens predicted by Whisper. The concern arises when opting for word-level prediction, as it activates timestamp prediction, which cannot be disabled as far as I know. This setup may inadvertently reduce timestamp accuracy or cause the word timestamps prediction to fail entirely, particularly for models fine-tuned without timestamp prediction.

Other frameworks can predict word timestamps appropriately with some arguments, for instance openai's whisper framework:

import whisper
whisper_model = whisper.load_model("large-v3")
result = whisper_model.transcribe("long_audio.wav",  without_timestamps=True, word_timestamps=True)

Who can help?

@sanchit-gandhi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Take any model fientuned without timestamps, and get bogus word timestamps because attention is all over the place

Expected behavior

It should be possible to disable timestamps generation when choosing word timestamps

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-10T13:52:01Z

cc @ylacombe too

Hubert-Bonisseur · 2024-04-11T08:50:11Z

I investigated a bit and got a more precise idea of the issue. There are actually two bugs with the pipeline, I think:

num_frames is not passed to the generate method, which makes the timestamps wrong:

With the pipeline:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)

ds = load_dataset("mozilla-foundation/common_voice_13_0", "fr", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps="word", generate_kwargs={
})
print(transcript)

All the word timestamps are set to 29.98s

Without using the pipeline:

features = processor(audio, return_tensors="pt",
                     truncation=False, sampling_rate=sr,
                     return_attention_mask=True)
generated = model.generate(features.input_features,
                           return_timestamps="word",
                           task="transcribe",
                           language="fr",
                           return_token_timestamps=True,
                           num_frames=int(len(audio) / processor.hop_length), # <-- doesn't work without this
                           is_multilingual=True)
print(generated["token_timestamps"])

The word timestamps are now appropriate.

Long form generation always enables timestamps, I think this is linked to this PR from @patrickvonplaten.
I added print(decoder_input_ids)to the forward method of Whisper to check the input tokens of the decoder and these are the first tokens fed to the forward method:
[50258, 50265, 50360, 50365] --> Notice we don't have the no_timestamps token and we generated the timestamp 0.0

Code to reproduce:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
ds = load_dataset("BrunoHays/multilingual-TEDX-fr", "max", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps=False) # No timestamps  
print(transcript)

I think the audio chunking method for long form should be used when timestamps are deactivated

amyeroberts · 2024-05-10T08:20:00Z

Gentle ping @ylacombe

kamilakesbi · 2024-05-17T08:41:25Z

Hi @Hubert-Bonisseur,

Thanks for sharing this issue!

Remark 1 was solved in PR Word-level timestamps broken for short-form audio #30325.
Regarding Remark 2: Long-form generation indeed requires timestamps to chunk the audios so this is an expected behavior I think. Using long-form generation significantly improves the model's performance in comparison to chunked transcription as indicated in this PR, so for now I don't think we should use audio chunking for long audios when timestamps are deactivated. WDYT @sanchit-gandhi ?

sanchit-gandhi · 2024-06-11T14:49:52Z

Agreed @Hubert-Bonisseur and @kamilakesbi! In the case you don't want to train another model, the only option for long-form transcription is doing chunked inference:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

If you're happy training another model, but don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here.

Hubert-Bonisseur · 2024-06-25T13:10:14Z

I missed the notifications, sorry
Thanks for your answers !

If I understand correctly, @sanchit-gandhi , chunked transcription is activated once we pass the chunk_length_s argument ?
And if it is left to default, we use the fancy long-form algorithm ?
Makes sense !

kamilakesbi · 2024-07-08T11:22:36Z

Hi @Hubert-Bonisseur,

You can indeed use chunked transcription by passing the chunk_length_s argument to the pipeline as shown in @sanchit-gandhi's script.

The default behavior is to use the long-form algorithm as it is more efficient :)

Hope it will help you!

Hubert-Bonisseur changed the title ~~Can not predict word timestamps with Whisper when generating timestamps is disabled with the pipeline~~ AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction Apr 9, 2024

amyeroberts added Core: Pipeline Internals of the library; Pipeline. Audio labels Apr 10, 2024

huggingface deleted a comment from github-actions bot May 10, 2024

kamilakesbi self-assigned this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

Hubert-Bonisseur commented Apr 9, 2024

amyeroberts commented Apr 10, 2024

Hubert-Bonisseur commented Apr 11, 2024

amyeroberts commented May 10, 2024

kamilakesbi commented May 17, 2024 •

edited

Loading

sanchit-gandhi commented Jun 11, 2024

Hubert-Bonisseur commented Jun 25, 2024

kamilakesbi commented Jul 8, 2024

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

Comments

Hubert-Bonisseur commented Apr 9, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 10, 2024

Hubert-Bonisseur commented Apr 11, 2024

amyeroberts commented May 10, 2024

kamilakesbi commented May 17, 2024 • edited Loading

sanchit-gandhi commented Jun 11, 2024

Hubert-Bonisseur commented Jun 25, 2024

kamilakesbi commented Jul 8, 2024

kamilakesbi commented May 17, 2024 •

edited

Loading