Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper decoding returns exception about outputs.logits shape #21057

Closed
1 of 4 tasks
nshmyrev opened this issue Jan 9, 2023 · 6 comments
Closed
1 of 4 tasks

Whisper decoding returns exception about outputs.logits shape #21057

nshmyrev opened this issue Jan 9, 2023 · 6 comments

Comments

@nshmyrev
Copy link

nshmyrev commented Jan 9, 2023

System Info

transformers version: 4.26.0.dev0

  • Platform: Linux-5.10.0-20-amd64-x86_64-with-glibc2.31
  • Python version: 3.9.2
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Same error on cuda servers

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run simple decoding with Whisper large:

    speech_array, sampling_rate = torchaudio.load(fn)
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    sound = resampler(speech_array).squeeze().numpy()
    input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

    with torch.no_grad():
        generated_ids = model.generate(inputs=input_features, max_length=1000)
        transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Result is an exception:

Traceback (most recent call last):
  File "/home/user/test_whisper_hf.py", line 37, in <module>
    generated_ids = model.generate(inputs=input_features, max_length=1000)
  File "/home/user/.local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/generation/utils.py", line 1352, in generate
    return self.greedy_search(
  File "/home/user/.local/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/generation/utils.py", line 2135, in greedy_search
    next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0

The output on this problematic file is

Seq2SeqLMOutput(loss=None, logits=tensor([], size=(1, 0, 51865)), past_key_values=((tensor([[[[ 1.3006e+00, -4.4066e-02, -2.5518e-02,  ...,  1.6218e-01,

This happens only with a single file in the dataset of 10k files.

Expected behavior

No exception

@sgugger
Copy link
Collaborator

sgugger commented Jan 9, 2023

cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey! Could you provide a reproducing script with the dataset? The file might be corrupted.

@nshmyrev
Copy link
Author

To reproduce you can try this code

#!/usr/bin/env python3

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

processor = WhisperProcessor.from_pretrained("mitchelldehaven/whisper-large-v2-ru")
model = WhisperForConditionalGeneration.from_pretrained("mitchelldehaven/whisper-large-v2-ru")

speech_array, sampling_rate = torchaudio.load("test.wav")
resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
sound = resampler(speech_array).squeeze().numpy()
input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

with torch.no_grad():
    generated_ids = model.generate(inputs=input_features, max_length=1000)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

with the attached file

test.zip

This thing happens with fine-tuned models between, not original ones.

@RuABraun
Copy link

RuABraun commented Feb 3, 2023

I have the same issue. Model is not finetuned

Could you find a workaround @nshmyrev ?

@ArthurZucker
Copy link
Collaborator

In this case, using an original model works:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

fn = "/home/arthur_huggingface_co/transformers/Arthur/test.wav"
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
speech_array, sampling_rate = torchaudio.load(fn)
resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
sound = resampler(speech_array).squeeze().numpy()
input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

with torch.no_grad():
        generated_ids = model.generate(inputs=input_features, max_length=1000)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

I get

 Duh duh duh duh uh huh.

When running with your model however, it seems that the max_len parameter is not taken into account, and the input_ids have a length of 449 which provokes the error. The model should stop. This can be caused because of various things, but I recommend setting the max_length to 448 as the model should not be fed with larger inputs. (it is the case for the original models.

@RuABraun can you share the audio and a reproduction script?

@RuABraun
Copy link

RuABraun commented Feb 3, 2023

I fixed it by lowering max_length. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants