OVModelForWhisper cannot extract word level timestamps #1691

barolo · 2024-02-13T15:57:57Z

System Info

optimum-1.17.0.dev0
openvino-2023.3.0
transfomers-4.37.2
python 3.11

Who can help?

@philschmid

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Setting argument for pipeline return_timestamps="word" results in failure. Using model openai-whisper/small, which is properly configured to output token level timestamps. Backend OpenVino GPU. Without word level argument the task finishes.

Resulting error:

Traceback (most recent call last):
  File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 58, in <module>
    result = pipe("./4.wav")
             ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

My script

from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from pathlib import Path
import openvino as ov
import json
import whisper

# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features

# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

# Choose device
device = 'gpu'  # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    stride_length_s=[4, 2],
    return_timestamps="word"
)


# Transcribe the audio
result = pipe("./4.wav")
# Save result to JSON
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)

print(result["text"])

Expected behavior

Pipeline outputs word/token level timestamps properly.

The text was updated successfully, but these errors were encountered:

fxmarty · 2024-02-16T13:48:39Z

cc @echarlaix

barolo · 2024-02-16T22:01:54Z

Here's a more concise example huggingface/optimum-intel#561
(I wasn't sure where to report the issue initially)

echarlaix · 2024-03-21T14:15:26Z

Closing it to keep discussion in huggingface/optimum-intel#561

barolo added the bug Something isn't working label Feb 13, 2024

barolo changed the title ~~OVModelForWhisper cannot extract work level timestamps~~ OVModelForWhisper cannot extract token level timestamps Feb 13, 2024

barolo changed the title ~~OVModelForWhisper cannot extract token level timestamps~~ OVModelForWhisper cannot extract word level timestamps Feb 13, 2024

echarlaix closed this as completed Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OVModelForWhisper cannot extract word level timestamps #1691

OVModelForWhisper cannot extract word level timestamps #1691

barolo commented Feb 13, 2024

fxmarty commented Feb 16, 2024

barolo commented Feb 16, 2024

echarlaix commented Mar 21, 2024

OVModelForWhisper cannot extract word level timestamps #1691

OVModelForWhisper cannot extract word level timestamps #1691

Comments

barolo commented Feb 13, 2024

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

fxmarty commented Feb 16, 2024

barolo commented Feb 16, 2024

echarlaix commented Mar 21, 2024