Whisper language is not returned when using word timestamps #29520

robinderat · 2024-03-07T16:21:16Z

System Info

transformers version: 4.38.2
Platform: macOS-13.6.1-x86_64-i386-64bit
Python version: 3.10.11
Huggingface_hub version: 0.21.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am using the automatic speech recognition pipeline with whisper large v3. I want to get both the language as detected by whisper and the timestamps for the individual words. Both behaviors are supported individually( return_language=True, return_timestamps="word"), however when combined, the language is no longer returned.

This is the code i am using:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, use_safetensors=True, torch_dtype=torch.float16, low_cpu_mem_usage=True,
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch.float16,
    return_language=True
)

result = pipe(audio_filepath, return_timestamps="word")
print(result)

I have done some digging in the source code and I believe I have found the problem in tokenization_whisper.py

Here, when using word timestamps, the existing chunks containing the language are ignored and only the words in the chunk are returned .

if return_timestamps or return_language:
        for chunk in chunks:
            if not return_timestamps:
                chunk.pop("timestamp")
            else:
                chunk["timestamp"] = tuple(chunk["timestamp"])
            if not return_language:
                chunk.pop("language")

        if return_timestamps == "word":
            new_chunks = []
            for chunk in chunks:
                new_chunks.extend(chunk["words"])
            optional = {"chunks": new_chunks}
        else:
            optional = {"chunks": chunks}
    else:
        optional = {}
    return full_text, optional

Expected behavior

I expect to get both the language and the word timestamps when using return_language=True and return_timestamps="word"

Considering the source code mentioned above, I see 2 possible ways of achieving this.

Add an extra list to the chunks when using return_timestamps='word'

This would result in the format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": TIMESTAMP, 
    "text": "CHUNK_TEXT", 
    "words": [
      "text": "WORD", 
      "timestamp": WORD_TIMESTAMP
    ]
  ]
}

Add the language information to the words

This would maintain the current structure and simply add the language of the chunk to its words, giving the following format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": WORD_TIMESTAMP, 
    "text": "WORD"
  ]
}

I believe solution 1 could be achieved simply by removing the if return_timestamps == "word": statement completely.
I believe solution 2 could be achieved by replacing the new_chunks.extend(chunk["words"]) with

for word in chunk["words"]:
    word["language"] = chunk["language"]
    new_chunks.append(word)

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-25T03:39:30Z

That is expected IMO: how do you choose which language to return? Do you return a language for each chunk? For each text between timestamps? Which language do you return the one that is the most used? I think these where considerations. Also getting the language first should be pretty easy.

cc @ylacombe as well

robinderat · 2024-04-18T10:36:06Z

I have implemented the 2nd approach I outlined in a fork of the project (which is not ideal) and it's working just fine. In my opinion, if you can predict the language for a chunk, there is no harm is saying all words in that chunk belong to that language.

If there is an easy way to get the language first that would be fine too, but I don't see how besides running the model twice, which is slow and wasteful

kamilakesbi · 2024-05-21T14:20:40Z

Hi @robinderat,

Thank you so much for this question!

This is indeed a bug, and fixing it would be a very nice contribution. The second approach you suggest looks good as it would keep the structure of the output and just add the missing attributes.

Would you like to open a PR for this fix, given that you have implemented it in a forked project?

This would be a really valuable fix for the Whisper community :)

amyeroberts added bug Audio labels Mar 7, 2024

ArthurZucker added the Feature request Request for a new feature label Mar 25, 2024

kamilakesbi assigned kamilakesbi and unassigned kamilakesbi May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper language is not returned when using word timestamps #29520

Whisper language is not returned when using word timestamps #29520

robinderat commented Mar 7, 2024

ArthurZucker commented Mar 25, 2024

robinderat commented Apr 18, 2024

kamilakesbi commented May 21, 2024

Whisper language is not returned when using word timestamps #29520

Whisper language is not returned when using word timestamps #29520

Comments

robinderat commented Mar 7, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 25, 2024

robinderat commented Apr 18, 2024

kamilakesbi commented May 21, 2024