Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

Open
2 of 4 tasks
Hubert-Bonisseur opened this issue Apr 9, 2024 · 7 comments
Assignees
Labels
Audio Core: Pipeline Internals of the library; Pipeline.

Comments

@Hubert-Bonisseur
Copy link

System Info

At present, the AutomaticSpeechRecognition pipeline offers the capability to predict timestamps either at the word level through cross attention or by utilizing timestamps tokens predicted by Whisper. The concern arises when opting for word-level prediction, as it activates timestamp prediction, which cannot be disabled as far as I know. This setup may inadvertently reduce timestamp accuracy or cause the word timestamps prediction to fail entirely, particularly for models fine-tuned without timestamp prediction.

Other frameworks can predict word timestamps appropriately with some arguments, for instance openai's whisper framework:

import whisper
whisper_model = whisper.load_model("large-v3")
result = whisper_model.transcribe("long_audio.wav",  without_timestamps=True, word_timestamps=True)

Who can help?

@sanchit-gandhi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Take any model fientuned without timestamps, and get bogus word timestamps because attention is all over the place

Expected behavior

It should be possible to disable timestamps generation when choosing word timestamps

@Hubert-Bonisseur Hubert-Bonisseur changed the title Can not predict word timestamps with Whisper when generating timestamps is disabled with the pipeline AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction Apr 9, 2024
@amyeroberts amyeroberts added Core: Pipeline Internals of the library; Pipeline. Audio labels Apr 10, 2024
@amyeroberts
Copy link
Collaborator

cc @ylacombe too

@Hubert-Bonisseur
Copy link
Author

I investigated a bit and got a more precise idea of the issue. There are actually two bugs with the pipeline, I think:

  1. num_frames is not passed to the generate method, which makes the timestamps wrong:

With the pipeline:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)

ds = load_dataset("mozilla-foundation/common_voice_13_0", "fr", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps="word", generate_kwargs={
})
print(transcript)

All the word timestamps are set to 29.98s

Without using the pipeline:

features = processor(audio, return_tensors="pt",
                     truncation=False, sampling_rate=sr,
                     return_attention_mask=True)
generated = model.generate(features.input_features,
                           return_timestamps="word",
                           task="transcribe",
                           language="fr",
                           return_token_timestamps=True,
                           num_frames=int(len(audio) / processor.hop_length), # <-- doesn't work without this
                           is_multilingual=True)
print(generated["token_timestamps"])

The word timestamps are now appropriate.

  1. Long form generation always enables timestamps, I think this is linked to this PR from @patrickvonplaten.
    I added print(decoder_input_ids)to the forward method of Whisper to check the input tokens of the decoder and these are the first tokens fed to the forward method:
    [50258, 50265, 50360, 50365] --> Notice we don't have the no_timestamps token and we generated the timestamp 0.0

Code to reproduce:

from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import  AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
ds = load_dataset("BrunoHays/multilingual-TEDX-fr", "max", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]

pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps=False) # No timestamps  
print(transcript)

I think the audio chunking method for long form should be used when timestamps are deactivated

@huggingface huggingface deleted a comment from github-actions bot May 10, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @ylacombe

@kamilakesbi kamilakesbi self-assigned this May 16, 2024
@kamilakesbi
Copy link
Contributor

kamilakesbi commented May 17, 2024

Hi @Hubert-Bonisseur,

Thanks for sharing this issue!

  • Remark 1 was solved in PR Word-level timestamps broken for short-form audio #30325.

  • Regarding Remark 2: Long-form generation indeed requires timestamps to chunk the audios so this is an expected behavior I think. Using long-form generation significantly improves the model's performance in comparison to chunked transcription as indicated in this PR, so for now I don't think we should use audio chunking for long audios when timestamps are deactivated. WDYT @sanchit-gandhi ?

@sanchit-gandhi
Copy link
Contributor

Agreed @Hubert-Bonisseur and @kamilakesbi! In the case you don't want to train another model, the only option for long-form transcription is doing chunked inference:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

If you're happy training another model, but don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here.

@Hubert-Bonisseur
Copy link
Author

I missed the notifications, sorry
Thanks for your answers !

If I understand correctly, @sanchit-gandhi , chunked transcription is activated once we pass the chunk_length_s argument ?
And if it is left to default, we use the fancy long-form algorithm ?
Makes sense !

@kamilakesbi
Copy link
Contributor

Hi @Hubert-Bonisseur,

You can indeed use chunked transcription by passing the chunk_length_s argument to the pipeline as shown in @sanchit-gandhi's script.

The default behavior is to use the long-form algorithm as it is more efficient :)

Hope it will help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio Core: Pipeline Internals of the library; Pipeline.
Projects
None yet
Development

No branches or pull requests

4 participants