Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVModelForWhisper cannot extract word level timestamps #1691

Closed
4 tasks done
barolo opened this issue Feb 13, 2024 · 3 comments
Closed
4 tasks done

OVModelForWhisper cannot extract word level timestamps #1691

barolo opened this issue Feb 13, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@barolo
Copy link

barolo commented Feb 13, 2024

System Info

optimum-1.17.0.dev0
openvino-2023.3.0
transfomers-4.37.2
python 3.11

Who can help?

@philschmid

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Setting argument for pipeline return_timestamps="word" results in failure. Using model openai-whisper/small, which is properly configured to output token level timestamps. Backend OpenVino GPU. Without word level argument the task finishes.

Resulting error:

Traceback (most recent call last):
  File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 58, in <module>
    result = pipe("./4.wav")
             ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

My script

from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from pathlib import Path
import openvino as ov
import json
import whisper

# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features

# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

# Choose device
device = 'gpu'  # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    stride_length_s=[4, 2],
    return_timestamps="word"
)


# Transcribe the audio
result = pipe("./4.wav")
# Save result to JSON
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)

print(result["text"])

Expected behavior

Pipeline outputs word/token level timestamps properly.

@barolo barolo added the bug Something isn't working label Feb 13, 2024
@barolo barolo changed the title OVModelForWhisper cannot extract work level timestamps OVModelForWhisper cannot extract token level timestamps Feb 13, 2024
@barolo barolo changed the title OVModelForWhisper cannot extract token level timestamps OVModelForWhisper cannot extract word level timestamps Feb 13, 2024
@fxmarty
Copy link
Collaborator

fxmarty commented Feb 16, 2024

cc @echarlaix

@barolo
Copy link
Author

barolo commented Feb 16, 2024

Here's a more concise example huggingface/optimum-intel#561
(I wasn't sure where to report the issue initially)

@echarlaix
Copy link
Collaborator

Closing it to keep discussion in huggingface/optimum-intel#561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants