Whisper decoding returns exception about outputs.logits shape #21057

nshmyrev · 2023-01-09T00:19:25Z

System Info

transformers version: 4.26.0.dev0

Platform: Linux-5.10.0-20-amd64-x86_64-with-glibc2.31
Python version: 3.9.2
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Same error on cuda servers

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run simple decoding with Whisper large:

    speech_array, sampling_rate = torchaudio.load(fn)
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    sound = resampler(speech_array).squeeze().numpy()
    input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

    with torch.no_grad():
        generated_ids = model.generate(inputs=input_features, max_length=1000)
        transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Result is an exception:

Traceback (most recent call last):
  File "/home/user/test_whisper_hf.py", line 37, in <module>
    generated_ids = model.generate(inputs=input_features, max_length=1000)
  File "/home/user/.local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/generation/utils.py", line 1352, in generate
    return self.greedy_search(
  File "/home/user/.local/lib/python3.9/site-packages/transformers-4.26.0.dev0-py3.9.egg/transformers/generation/utils.py", line 2135, in greedy_search
    next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0

The output on this problematic file is

Seq2SeqLMOutput(loss=None, logits=tensor([], size=(1, 0, 51865)), past_key_values=((tensor([[[[ 1.3006e+00, -4.4066e-02, -2.5518e-02,  ...,  1.6218e-01,

This happens only with a single file in the dataset of 10k files.

Expected behavior

No exception

The text was updated successfully, but these errors were encountered:

sgugger · 2023-01-09T17:35:27Z

cc @ArthurZucker

ArthurZucker · 2023-01-10T12:09:59Z

Hey! Could you provide a reproducing script with the dataset? The file might be corrupted.

nshmyrev · 2023-01-10T20:58:24Z

To reproduce you can try this code

#!/usr/bin/env python3

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

processor = WhisperProcessor.from_pretrained("mitchelldehaven/whisper-large-v2-ru")
model = WhisperForConditionalGeneration.from_pretrained("mitchelldehaven/whisper-large-v2-ru")

speech_array, sampling_rate = torchaudio.load("test.wav")
resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
sound = resampler(speech_array).squeeze().numpy()
input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

with torch.no_grad():
    generated_ids = model.generate(inputs=input_features, max_length=1000)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

with the attached file

test.zip

This thing happens with fine-tuned models between, not original ones.

RuABraun · 2023-02-03T01:31:57Z

I have the same issue. Model is not finetuned

Could you find a workaround @nshmyrev ?

ArthurZucker · 2023-02-03T16:18:46Z

In this case, using an original model works:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch

fn = "/home/arthur_huggingface_co/transformers/Arthur/test.wav"
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
speech_array, sampling_rate = torchaudio.load(fn)
resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
sound = resampler(speech_array).squeeze().numpy()
input_features = processor(sound, return_tensors="pt", sampling_rate=16_000).input_features

with torch.no_grad():
        generated_ids = model.generate(inputs=input_features, max_length=1000)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

I get

 Duh duh duh duh uh huh.

When running with your model however, it seems that the max_len parameter is not taken into account, and the input_ids have a length of 449 which provokes the error. The model should stop. This can be caused because of various things, but I recommend setting the max_length to 448 as the model should not be fed with larger inputs. (it is the case for the original models.

@RuABraun can you share the audio and a reproduction script?

RuABraun · 2023-02-03T16:42:44Z

I fixed it by lowering max_length. Thanks

ArthurZucker closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper decoding returns exception about outputs.logits shape #21057

Whisper decoding returns exception about outputs.logits shape #21057

nshmyrev commented Jan 9, 2023 •

edited

sgugger commented Jan 9, 2023

ArthurZucker commented Jan 10, 2023

nshmyrev commented Jan 10, 2023

RuABraun commented Feb 3, 2023 •

edited

ArthurZucker commented Feb 3, 2023

RuABraun commented Feb 3, 2023

Whisper decoding returns exception about outputs.logits shape #21057

Whisper decoding returns exception about outputs.logits shape #21057

Comments

nshmyrev commented Jan 9, 2023 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Jan 9, 2023

ArthurZucker commented Jan 10, 2023

nshmyrev commented Jan 10, 2023

RuABraun commented Feb 3, 2023 • edited

ArthurZucker commented Feb 3, 2023

RuABraun commented Feb 3, 2023

nshmyrev commented Jan 9, 2023 •

edited

RuABraun commented Feb 3, 2023 •

edited